Essential Mathematics for Political and Social Research

More than ever before, modern social scientists require a basic level of mathematical literacy, yet many students receive only limited mathematical training prior to beginning their research careers. This textbook addresses this dilemma by offering a comprehensive, uniﬁed introduction to the essential mathematics of social science. Throughout the book the presentation builds from ﬁrst principles and eschews unnecessary complexity. Most importantly, the discussion is thoroughly and consistently anchored in real social science applications, with more than 80 research-based illustrations woven into the text and featured in end-of-chapter exercises. Students and researchers alike will ﬁnd this ﬁrst-ofits-kind volume to be an invaluable resource. Jeff Gill is Associate Professor of Political Science at the University of California, Davis. His primary research applies Bayesian modeling and data analysis to substantive questions in voting, public policy, budgeting, bureaucracy, and Congress. He is currently working in areas of Markov chain Monte Carlo theory. His work has appeared in journals such as the Journal of Politics, Political Analysis, Electoral Studies, Statistical Science, Sociological Methods and Research, Public Administration Review, and Political Research Quarterly. He is the author or coauthor of several books including Bayesian Methods: A Social and Behavioral Sciences Approach (2002), Numerical Issues in Statistical Computing for the Social Scientist (2003), and Generalized Linear Models: A Uniﬁed Approach (2000).

Analytical Methods for Social Research
Analytical Methods for Social Research presents texts on empirical and formal methods for the social sciences. Volumes in the series address both the theoretical underpinnings of analytical techniques, as well as their application in social research. Some series volumes are broad in scope, cutting across a number of disciplines. Others focus mainly on methodological applications within speciﬁc ﬁelds such as political science, sociology, demography, and public health. The series serves a mix of students and researchers in the social sciences and statistics. Series Editors: R. Michael Alvarez, California Institute of Technology Nathaniel L. Beck, New York University Lawrence L. Wu, New York University

Other Titles in the Series: Event History Modeling: A Guide for Social Scientists, by Janet M. BoxSteffensmeier and Bradford S. Jones Ecological Inference: New Methodological Strategies, edited by Gary King, Ori Rosen, and Martin A. Tanner Spatial Models of Parliamentary Voting, by Keith T. Poole

Essential Mathematics for Political and Social Research
Jeff Gill
University of California, Davis

Two Ideal Points in the Senate A Relation That Is Not a Function Relating x and f (x) The Cube Law Parallel and Perpendicular Lines Poverty and Reading Test Scores Nautilus Chambers Radian Measurement of Angles Polar Coordinates A General Trigonometric Setup Basic Trigonometric Function in Radians Views of Conflict Escalation Characteristics of a Parabola Parabolic Presidential Popularity Characteristics of an Ellipse Multidimensional Issue Preference Characteristics of a Hyperbola Exponential Curves Derived Hyperbola Form xiv

Explaining Why People Vote Graphing Ideal Points in the Senate The “Cube Rule” in Votes to Seats Child Poverty and Reading Scores Coalition Cabinet Formation An Expected Utility Model of Conﬂict Escalation Testing for a Circular Form Presidential Support as a Parabola Elliptical Voting Preferences Hyperbolic Discounting in Evolutionary Psychology and Behavioral Economics

Craps in a Casino Contraception Use in Barbados Campaign Contributions Shufﬂing Cards Conﬂict and Cooperation in Rural Andean Communities Population Migration Within Malawi

Preface

This book is intended to serve several needs. First (and perhaps foremost), it is supposed to be an introduction to mathematical principles for incoming social science graduate students. For this reason, there is a large set of examples (83 of them, at last count) drawn from various literatures including sociology, political science, anthropology, psychology, public policy, communications, and geography. Almost no example is produced from “hypothetical data.” This approach is intended not only to motivate speciﬁc mathematical principles and practices, but also to introduce the way that social science researchers use these tools. With this approach the topics presumably retain needed relevance. The design of the book is such that this endeavor can be a semester-long adjunct to another topic like data analysis or it can support the prefresher “mathcamp” approach that is becoming increasingly popular. Second, this book can also serve as a single reference work where multiple books would ordinarily be needed. To support this function there is extensive indexing and referencing designed to make it easy to ﬁnd topics. Also in support of this purpose, there are some topics that may not be suitable for course work that are deliberately included for this purpose (i.e., things like calculus on trigonometric functions and advanced linear algebra topics). Third, the format is purposely made conxxi

xxii

Preface

ducive to self-study. Many of us missed or have forgotten various mathematical topics and then ﬁnd we need them later. The main purpose of the proposed work is to address an educational deﬁciency in the social and behavioral sciences. The undergraduate curriculum in the social sciences tends to underemphasize mathematical topics that are then required by students at the graduate level. This leads to some discomfort whereby incoming graduate students are often unprepared for and uncomfortable with the necessary mathematical level of research in these ﬁelds. As a result, the methodological training of graduate students increasingly often begins with intense “prequel” seminars wherein basic mathematical principles are taught in short (generally week-long) programs just before beginning the regular ﬁrst-year course work. Usually these courses are taught from the instructor’s notes, selected chapters from textbooks, or assembled sets of monographs or books. There is currently no tailored book-length work that speciﬁcally addresses the mathematical topics of these programs. This work ﬁlls this need by providing a comprehensive introduction to the mathematical principles needed by modern research social scientists. The material introduces basic mathematical principles necessary to do analytical work in the social sciences, starting from ﬁrst principles, but without unnecessary complexity. The core purpose is to present fundamental notions in standard notation and standard language with a clear, uniﬁed framework throughout Although there is an extensive literature on mathematical and statistical methods in the social sciences, there is also a dearth of introduction to the underlying language used in these works, exacerbating the fact that many students in social science graduate programs enter with an undergraduate education that contains no regularized exposure to the mathematics they will need to succeed in graduate school. Actually, the book is itself a prerequisite, so for obvious reasons the prerequisites to this prerequisite are minimal. The only required material is knowledge of high school algebra and geometry. Most target students will have had very little

Preface

xxiii

mathematical training beyond this modest level. Furthermore, the ﬁrst chapter is sufﬁciently basic that readers who are comfortable with only arithmetic operations on symbolic quantities will be able to work through the material. No prior knowledge of statistics, probability, or nonscalar representations will be required. The intended emphasis is on a conceptual understanding of key principles and in their subsequent application through examples and exercises. No proofs or detailed derivations will be provided. The book has two general divisions reﬂecting a core component along with associated topics. The ﬁrst section comprises six chapters and is focused on basic mathematical tools, matrix algebra, and calculus. The topics are all essential, deterministic, mathematical principles. The primary goal of this section is to establish the mathematical language used in formal theory and mathematical analysis, as practiced in the social sciences. The second section, consisting of three chapters, is designed to give the background required to proceed in standard empirical quantitative analysis courses such as social science statistics and mathematical analysis for formal theory. Although structure differs somewhat by chapter, there is a general format followed within each. There is motivation given for the material, followed by a detailed exposition of the concepts. The concepts are illustrated with examples that social scientists care about and can relate to. This last point is not trivial. A great many books in these areas center on engineering and biology examples, and the result is often reduced reader interest and perceived applicability in the social sciences. Therefore, every example is taken from the social and behavioral sciences. Finally, each chapter has a set of exercises designed to reinforce the primary concepts. There are different ways to teach from this book. The obvious way is to cover the ﬁrst six chapters sequentially, although aspects of the ﬁrst two chapters may be skipped for a suitably prepared audience. Chapter 2 focuses on trigonometry, and this may not be needed for some programs. The topics in Chapters 7, 8, and 9 essentially constitute a “pre-statistics” course for social science graduate

xxiv

Preface

students. This may or may not be useful for speciﬁc purposes. The last chapter on Markov chains addresses a topic that has become increasingly important. This tool is used extensively in both mathematical modeling and Bayesian statistics. In addition, this chapter is a useful way to practice and reinforce the matrix algebra principles covered in Chapters 3 and 4. This book can also be used in a “just in time” way whereby a course on mathematical modeling or statistics proceeds until certain topics in matrix algebra, calculus, or random variables are needed. As noted, one intended use of this book is through a “math-camp” approach where incoming graduate students are given a pre-semester intensive introduction to the mathematical methods required for their forthcoming study. This is pretty standard in economics and is increasingly done in political science, sociology, and other ﬁelds. For this purpose, I recommend one of two possible abbreviated tracks through the material:

the latter two chapters. Conversely, a lighter pre-statistics approach that does not need to focus on theory involving calculus might look like the following: Standard Pre-Statistics Program • Chapter 1: The Basics. • Chapter 3: Linear Algebra: Vectors, Matrices, and Operations. • Chapter 7: Probability Theory. • Chapter 8: Random Variables. This program omits Chapter 5 from the previous listing but sets students up for such standard regression texts as Hanushek and Jackson (1977), Gujarati (1995), Neter et al. (1996), Fox (1997), or the forthcoming text in this series by Schneider and Jacoby. For an even “lighter” version of this program, parts of Chapter 3 could be omitted. Each chapter is accompanied by a set of exercises. Some of these are purely mechanical and some are drawn from various social science publications. The latter are designed to provide practice and also to show the relevance of the pertinent material. Instructors will certainly want to tailor assignments rather than require the bulk of these problems. In addition, there is an instructor’s manual containing answers to the exercises available from Cambridge University Press. It is a cliche to say, but this book was not created in a vacuum and numerous people read, perused, commented on, criticized, railed at, and even taught from the manuscript. These include Attic Access, Mike Alvarez, Maggie Bakhos, Ryan Bakker, Neal Beck, Scott Desposato, James Fowler, Jason Gainous, Scott Gartner, Hank Heitowit, Bob Huckfeldt, Bob Jackman, Marion Jagodka, Renee Johnson, Cindy Kam, Paul Kellstedt, Gary King, Jane Li, Michael Martinez, Ryan T. Moore, Will Moore, Elise Oranges, Bill Reed, Marc Rosenblum, Johny Sebastian, Will Terry, Les Thiele, Shawn Treier, Kevin Wagner, Mike Ward, and Guy Whitten. I apologize to anyone inadvertently left off this list. In particular, I thank Ed Parsons for his continued assistance and patience in helping get this project done. I have also enjoyed the continued support of various chairs,

xxvi

Preface

deans, and colleagues at the University of California–Davis, the ICPSR Summer Program at the University of Michigan, and at Harvard University. Any errors that may remain, despite this outstanding support network, are entirely the fault of the author. Please feel free to contact me with comments, complaints, omissions, general errata, or even praise (jgill@ucdavis.edu).

1
The Basics

1.1 Objectives This chapter gives a very basic introduction to practical mathematical and arithmetic principles. Some readers who can recall their earlier training in high school and elsewhere may want to skip it or merely skim over the vocabulary. However, many often ﬁnd that the various other interests in life push out the assorted artifacts of functional expressions, logarithms, and other principles. Usually what happens is that we vaguely remember the basic ideas without speciﬁc properties, in the same way that we might remember that the assigned reading of Steinbeck’s Grapes of Wrath included poor people traveling West without remembering all of the unfortunate details. To use mathematics effectively in the social sciences, however, it is necessary to have a thorough command over the basic mathematical principles in this chapter. Why is mathematics important to social scientists? There are two basic reasons, where one is more philosophical than the other. A pragmatic reason is that it simply allows us to communicate with each other in an orderly and systematic way; that is, ideas expressed mathematically can be more carefully deﬁned and more directly communicated than with narrative language, which is more susceptible to vagueness and misinterpretation. The causes of these 1

2

The Basics

effects include multiple interpretations of words and phrases, near-synonyms, cultural effects, and even poor writing quality. The second reason is less obvious, and perhaps more debatable in social science disciplines. Plato said “God ever geometrizes” (by extension, the nineteenth-century French mathematician Carl Jacobi said “God ever arithmetizes”). The meaning is something that humans have appreciated since before the building of the pyramids: Mathematics is obviously an effective way to describe our world. What Plato and others noted was that there is no other way to formally organize the phenomena around us. Furthermore, awesome physical forces such as the movements of planets and the workings of atoms behave in ways that are described in rudimentary mathematical notation. What about social behavior? Such phenomena are equally easy to observe but apparently more difﬁcult to describe in simple mathematical terms. Substantial progress dates back only to the 1870s, starting with economics, and followed closely by psychology. Obviously something makes this more of a challenge. Fortunately, some aspects of human behavior have been found to obey simple mathematical laws: Violence increases in warmer weather, overt competition for hierarchical place increases with group size, increased education reduces support for the death penalty, and so on. These are not immutable, constant forces, rather they reﬂect underlying phenomena that social scientists have found and subsequently described in simple mathematical form.

1.2 Essential Arithmetic Principles We often use arithmetic principles on a daily basis without considering that they are based on a formalized set of rules. Even though these rules are elementary, it is worth stating them here. For starters, it is easy to recall that negative numbers are simply positive numbers multiplied by −1, that fractions represent ratios, and that multiplication can be represented in several ways (a × b = (a)(b) = a · b = a ∗ b). Other rules are more elusive but no less important. For instance, the order of operations

1.2 Essential Arithmetic Principles

3

gives a unique answer for expressions that have multiple arithmetic actions. The order is (1) perform operations on individual values ﬁrst, (2) evaluate parenthetical operations next, (3) do multiplications and divisions in order from left to right and, ﬁnally, (4) do additions and subtractions from left to right. So we would solve the following problem in the speciﬁed order: 23 + 2 × (2 × 5 − 4)2 − 30 = 8 + 2 × (2 × 5 − 4)2 − 30 = 8 + 2 × (10 − 4)2 − 30 = 8 + 2 × (6)2 − 30 = 8 + 2 × 36 − 30 = 8 + 72 − 30 = 50. In the ﬁrst step there is only one “atomic” value to worry about, so we take 2 to the third power ﬁrst. Because there are no more of these, we proceed to evaluating the operations in parentheses using the same rules. Thus 2 × 5 − 4 becomes 6 before it is squared. There is one more multiplication to worry about followed by adding and subtracting from left to right. Note that we would have gotten a diﬀerent answer if we had not followed these rules. This is important as there can be only one mathematically correct answer to such questions. Also, when parentheses are nested, then the order (as implied above) is to start in the innermost expression and work outward. For instance, (((2 + 3) × 4) + 5) = (((5) × 4) + 5) = ((20) + 5) = 25. Zero represents a special number in mathematics. Multiplying by zero produces zero and adding zero to some value leaves it unchanged. Generally the only thing to worry about with zero is that dividing any number by zero (x/0 for any x) is undeﬁned. Interestingly, this is true for x = 0 as well. The number 1 is another special number in mathematics and the history of mathematics, but it has no associated troublesome characteristic. Some basic functions and expressions will be used liberally in the text without

4

The Basics

further explanation. Fractions can be denoted x/y or x . The absolute value y of a number is the positive representation of that number. Thus |x| = x if x is positive and |x| is −x if x is negative. The square root of a number is a radical √ √ 1 of order two: x = 2 x = x 2 , and more generally the principle root is √ 1 r x = xr for numbers x and r. In this general case x is called the radican and r is called the index. For example, √ 1 3 8 = 83 = 2 because 23 = 8.

1.3 Notation, Notation, Notation One of the most daunting tasks for the beginning social scientist is to make sense of the language of their discipline. This has two general dimensions: (1) the substantive language in terms of theory, ﬁeld knowledge, and socialized terms; and (2) the formal means by which these ideas are conveyed. In a great many social science disciplines and subdisciplines the latter is the notation of mathematics. By notation we do not mean the use of speciﬁc terms per se (see Section 1.4 for that discussion); instead we mean the broad use of symbology to represent values or levels of phenomena; interrelations among these, and a logical, consistent manipulation of this symbology. Why would we use mathematics to express ideas about ideas in anthropology, political science, public policy, sociology, psychology, and related disciplines? Precisely because mathematics let us exactly convey asserted relationships between quantities of interest. The key word in that last sentence is exactly : We want some way to be precise in claims about how some social phenomenon affects another social phenomenon. Thus the purchase of mathematical rigor provides a careful and exacting way to analyze and discuss the things we actually care about.

1.3 Notation, Notation, Notation Example 1.1:

5

Explaining Why People Vote. This is a simple example

from voting theory. Anthony Downs (1957) claimed that a rational voter (supposedly someone who values her time and resources) would weigh the cost of voting against the gains received from voting. These rewards are asserted to be the value from a preferred candidate winning the election times the probability that her vote will make an actual difference in the election. It is common to “measure” the difference between cost and reward as the utility that the person receives from the act. “Utility” is a word borrowed from economists that simply speciﬁes an underlying preference scale that we usually cannot directly see. This is generally not complicated: I will get greater utility from winning the state lottery than I will from winning the ofﬁce football pool, or I will get greater utility from spending time with my friends than I will from mowing the lawn. Now we should make this idea more “mathematical” by specifying a relevant relationship. Riker and Ordeshook (1968) codiﬁed the Downsian model into mathematical symbology by articulating the following variables for an individual voter given a choice between two candidates: R P B C = = = = the utility satisfaction of voting the actual probability that the voter will affect the outcome with her particular vote the perceived difference in beneﬁts between the two candidates measured in utiles (units of utility): B1 − B2 the actual cost of voting in utiles (i.e., time, effort, money).

Thus the Downsian model is thus represented as R = P B − C. This is an unbelievably simple yet powerful model of political participation. In fact, we can use this statement to make claims that would not be as clear or as precise if described in descriptive language alone. For instance, consider these statements:

6

The Basics • The voter will abstain if R < 0. • The voter may still not vote even if R > 0 if there exist other competing activities that produce a higher R. • If P is very small (i.e., it is a large election with many voters), then it is unlikely that this individual will vote. The last statement leads to what is called the paradox of participation: If nobody’s vote is decisive, then why would anyone vote? Yet we can see that many people actually do show up and vote in large general elections. This paradox demonstrates that there is more going on than our simple model above. The key point from the example above is that the formalism such mathemati-

cal representation provides gives us a way to say more exact things about social phenomena. Thus the motivation for introducing mathematics into the study of the social and behavioral sciences is to aid our understanding and improve the way we communicate substantive ideas.

1.4 Basic Terms Some terms are used ubiquitously in social science work. A variable is just a symbol that represents a single number or group of numbers. Often variables are used as a substitution for numbers that we do not know or numbers that we will soon observe from some political or social phenomenon. Most frequently these are quantities like X, Y , a, b, and so on. Oddly enough, the modern notion of a variable was not codiﬁed until the early nineteenth century by the German mathematician Lejeune Dirichlet. We also routinely talk about data: collections of observed phenomenon. Note that data is plural; a single point is called a datum or a data point. There are some other conventions from mathematics and statistics (as well as some other ﬁelds) that are commonly used in social science research as well. Some of these are quite basic, and social scientists speak this technical language

1.4 Basic Terms

7

ﬂuently. Unless otherwise stated, variables are assumed to be deﬁned on the Cartesian coordinate system .† If we are working with two variables x and y, then there is an assumed perpendicular set of axes where the x-axis (always given horizontally) is crossed with the y-axis (usually given vertically), such that the number pair (x, y) deﬁnes a point on the two-dimensional graph. There is actually no restriction to just two dimensions; for instance a point in 3-space is typically notated (x, y, z). Example 1.2: Graphing Ideal Points in the Senate. One very active

area of empirical research in political science is the estimation and subsequent use of legislative ideal points [see Jackman (2001), Londregan (2000), Poole and Rosenthal (1985, 1997)]. The objective is to analyze a member’s voting record with the idea that this member’s ideal policy position in policyspace can be estimated. This gets really interesting when the entire chamber (House, Senate, Parliament) is estimated accordingly, and various voting outcomes are analyzed or predicted. Figure 1.1 shows approximate ideal points for Ted Kennedy and Oren Hatch on two proposed projects (it is common to propose Hatch as the foil for Kennedy). Senator Hatch is assumed to have an ideal point in this twodimensional space at x = 5, y = 72, and Ted Kennedy is assumed to have an ideal point at x = 89, y = 17. These values are obtained from interest group rankings provided by the League of Conservation voters (2003) and the National Taxpayers Union (2003). We can also estimate the ideal points of other Senators in this way: One would guess that Trent Lott would be closer to Hatch than Kennedy, for instance.

† Alternatives exist such as “spherical space,” where lines are deﬁned on a generalization of circular space so they cannot be parallel to each other and must return to their point of origin, as well as Lobachevskian geometry and Kleinian geometry. These and other related systems are not generally useful in the social sciences and will therefore not be considered here with the exception of general polar coordinates in Chapter 2.

8

The Basics
Fig. 1.1. Two Ideal Points in the Senate
100 80

Orin Hatch
seq(−1, 100, length = 10)

Tax Cuts

20

40

60

Ted Kennedy

0

0

20

40

60

80

100

seq(−1, 100, length = 10)

Park Lands

Now consider a hypothetical trade-off between two bills competing for limited federal resources. These are appropriations (funding) for new national park lands, and a tax cut (i.e., national resources protection and development versus reducing taxes and thus taking in less revenue for the federal government). If there is a greater range of possible compromises, then other in-between points are possible. The best way to describe the possible space of solutions here is on a two-dimensional Cartesian coordinate system. Each Senator is assumed to have an ideal spending level for the two projects that trades off spending in one dimension against another: the level he or she would pick if they controlled the Senate completely. By convention we bound this in the two dimensions from 0 to 100.

1.4 Basic Terms

9

The point of Figure 1.1 is to show how useful the Cartesian coordinate system is at describing positions along political and social variables. It might be more crowded, but it would not be more complicated to map the entire Senate along these two dimensions. In cases where more dimensions are considered, the graphical challenges become greater. There are two choices: show a subset on a two- or three-dimensional plot, or draw combinations of dimensions in a two-dimensional format by pairing two at a time. Actually, in this Senate example, the use of the Cartesian coordinate system has been made quite restrictive for ease of analysis in this case. In the more typical, and more general, setting both the x-axis and the y-axis span negative inﬁnity to positive inﬁnity (although we obviously cannot draw them that way), and the space is labeled R2 to denote the crossing of two real lines. The real line is the line from minus inﬁnity to positive inﬁnity that contains the real numbers: numbers that are expressible in fractional form (2/5, 1/3, etc.) as well as those that are not because they have nonrepeating and inﬁnitely continuing decimal values. There are therefore an inﬁnite quantity of real numbers for any √ interval on the real line because numbers like 2 exist without “ﬁnishing” or √ repeating patterns in their list of values to the right of the decimal point ( 2 = 1.41421356237309504880168872420969807856967187537694807317 . . .). It is also common to deﬁne various sets along the real line. These sets can be convex or nonconvex. A convex set has the property that for any two members of the set (numbers) x1 and x2 , the number x3 = δx1 +(1−δ)x2 (for 0 ≤ δ ≤ 1) is also in the set. For example, if δ = 1 , then x3 is the average (the mean, see 2 below) of x1 and x2 . In the example above we would say that Senators are constrained to express their preferences in the interval [0 : 100], which is commonly used as a measure of ideology or policy preference by interest groups that rate elected ofﬁcials [such as the Americans for Democratic Action (ADA), and the American
Conservative Union (ACU)]. Interval notation is used frequently in math-

ematical notation, and there is only one important distinction: Interval ends

10

The Basics

can be “open” or “closed.” An open interval excludes the end-point denoted with parenthetical forms “(” and “)” whereas the closed interval denoted with bracket forms “[” and “]” includes it (the curved forms “{” and “}” are usually reserved for set notation). So, in altering our Senate example, we have the following one-dimensional options for x (also for y): open on both ends: closed on both ends: closed left, open right open left, closed right (0:100), 0 < x < 100 [0:100], 0 ≤ x ≤ 100 [0:100), 0 ≤ x < 100 (0:100], 0 < x ≤ 100

Thus the restrictions on δ above are that it must lie in [0:1]. These intervals can also be expressed in comma notation instead of colon notation: [0, 100].

1.4.1 Indexing and Referencing Another common notation is the technique of indexing observations on some variable by the use of subscripts. If we are going to list some value like years served in the House of Representatives (as of 2004), we would not want to use some cumbersome notation like Abercrombie = 14 Acevedo-Vila = 14 Ackerman = 21 Aderholt = 8 . . . . . . Wu = 6 Wynn = 12 Young = 34 Young = 32

which would lead to awkward statements like “Abercrombie’s years in ofﬁce” + “Acevedo-Vila’s years in ofﬁce”. . . + “Young’s years in ofﬁce” to express

1.4 Basic Terms

11

mathematical manipulation (note also the obvious naming problem here as well, i.e., delineating between Representative Young of Florida and Representative Young of Alaska). Instead we could just assign each member ordered alphabetically to an integer 1 through 435 (the number of U.S. House members) and then index them by subscript: X = {X1 , X2 , X3 , . . . , X433 , X434 , X435 }. This is a lot cleaner and more mathematically useful. For instance, if we wanted to calculate the mean (average) time served, we could simply perform: 1 (X1 + X2 + X3 + · · · + X433 + X434 + X435 ) 435

X=

(the bar over X denotes that this average is a mean, something we will see frequently). Although this is cleaner and easier than spelling names or something like that, there is an even nicer way of indicating a mean calculation that uses the summation operator. This is a large version of the Greek letter sigma where the starting and stopping points of the addition process are spelled out over and under the symbol. So the mean House seniority calculation could be speciﬁed simply by 1 ¯ X= 435
435

Xi ,
i=1

where we say that i indexes X in the summation. One way to think of this notation is that is just an adding “machine” that instructs us which X to start with and which one to stop with. In fact, if we set n = 435, then this becomes the simple (and common) form
n

1 ¯ X= n

Xi .
i=1

More formally,

12

The Basics

The Summation Operator If X1 , X2 , . . . , Xn are n numerical values, then their sum can be represented by
n i=1

Xi ,

where i is an indexing variable to indicate the starting and stopping points in the series X1 , X2 , . . . , Xn .

A related notation is the product operator. This is a slightly different “machine” denoted by an uppercase Greek pi that tells us to multiply instead of add as we did above:
n

Xi
i=1

(i.e., it multiplies the n values together). Here we also use i again as the index, but it is important to note that there is nothing special about the use of i; it is just a very common choice. Frequent index alternatives include j, k, l, and m. As a simple illustration, suppose p1 = 0.2, p2 = 0.7, p3 = 0.99, p4 = 0.99, p5 = 0.99. Then
5

pj = p1 · p2 · p3 · p4 · p5
j=1

= (0.2)(0.7)(0.99)(0.99)(0.99) = 0.1358419.

Similarly, the formal deﬁnition for this operator is given by

1.4 Basic Terms

13

The Product Operator If X1 , X2 , . . . , Xn are n numerical values, then their product can be represented by
n i=1

Xi ,

where i is an indexing variable to indicate the starting and stopping points in the series X1 , X2 , . . . , Xn .

Subscripts are used because we can immediately see that they are not a mathematical operation on the symbol being modiﬁed. Sometimes it is also convenient to index using a superscript. To distinguish between a superscript as an index and an exponent operation, brackets or parentheses are often used. So X 2 is the square of X, but X [2] and X (2) are indexed values. There is another, sometimes confusing, convention that comes from six decades of computer notation in the social sciences and other ﬁelds. Some authors will index values without the subscript, as in X1, X2, . . ., or differing functions (see Section 1.5 for the deﬁnition of a function) without subscripting according to f 1, f 2, . . .. Usually it is clear what is meant, however.

1.4.2 Speciﬁc Mathematical Use of Terms The use of mathematical terms can intimidate readers even when the author does not mean to do so. This is because many of them are based on the Greek alphabet or strange versions of familiar symbols (e.g., ∀ versus A). This does not mean that the use of these symbols should be avoided for readability. Quite the opposite; for those familiar with the basic vocabulary of mathematics such symbols provide a more concise and readable story if they can clearly summarize ideas that would be more elaborate in narrative. We will save the complete list

14

The Basics

of Greek idioms to the appendix and give others here, some of which are critical in forthcoming chapters and some of which are given for completeness. Some terms are almost universal in their usage and thus are important to recall without hesitation. Certain probability and statistical terms will be given as needed in later chapters. An important group of standard symbols are those that deﬁne the set of numbers in use. These are Symbol R R R I I or Z
− + + + + −

Explanation the set of real numbers the set of positive real numbers the set of negative real numbers the set of integers the set of positive integers the set of negative integers the set of rational numbers the set of positive rational numbers the set of negative rational numbers the set of complex numbers (those based on √ −1).

I or Z Q Q+ Q C
−

Recall that the real numbers take on an inﬁnite number of values: rational (expressible in fraction form) and irrational (not expressible in fraction form with values to the right of the decimal point, nonrepeating, like pi). It is interesting to note that there are an inﬁnite number of irrationals and every irrational √ falls between two rational numbers. For example, 2 is in between 7/5 and 3/2. Integers are positive and negative (rational) numbers with no decimal component and sometimes called the “counting numbers.” Whole numbers are positive integers along with zero, and natural numbers are positive integers without zero. We will not generally consider here the set of complex num√ bers, but they are those that include the imaginary number: i = −1, as in √ √ −4 = 2 −1 = 2i. In mathematical and statistical modeling it is often important to remember which of these number types above is being considered. Some terms are general enough that they are frequently used with state-

1.4 Basic Terms

15

ments about sets or with standard numerical declarations. Other forms are more obscure but do appear in certain social science literatures. Some reasonably common examples are listed in the next table. Note that all of these are
contextual, that is, they lack any meaning outside of sentence-like statements

with other symbols. Symbol ¬ ∈ ∴ ∵ =⇒ ⇐⇒ ∃ ∀ Explanation logical negation statement is an element of, as in 3 ∈ I + such that therefore because logical “then” statement if and only if, also abbreviated “iff” there exists for all between parallel ∠ angle

Also, many of these symbols can be negated, and negation is expressed in one of two ways. For instance, ∈ means “is an element of,” but both ∈ and ¬ ∈ mean “is not an element of.” Similarly, ⊂ means “is a subset of,” but ⊂ means “is not a subset of.” Some of these terms are used in a very linguistic fashion: 3 − 4 ∈ R− ∵ 3 < 4. The “therefore” statement is usually at the end of some logic: 2 ∈ I + ∴ 2 ∈ R+ . The last three in this list are most useful in geometric expressions and indicate spatial characteristics. Here is a lengthy mathematical statement using most of these symbols: ∀x ∈ I + and x¬prime, ∃y ∈ I + x/y ∈ I + . So what does this mean? Let’s parse it: “For all numbers x such that x is a positive integer and not a prime number, there exists a y that is a positive integer such that x divided by y is also a positive integer.” Easy, right? (Yeah, sure.) Can

16 you construct one yourself?

The Basics

Another “fun” example is x ∈ I and x = 0 =⇒ x ∈ I − or I + . This says that if x is a nonzero integer, it is either a positive integer or a negative integer. Consider this in pieces. The ﬁrst part, x ∈ I, stipulates that x is “in” the group of integers and cannot be equal to zero. The right arrow, =⇒, is a logical consequence statement equivalent to saying “then.” The last part gives the result, either x is a negative integer or a positive integer (and nothing else since no alternatives are given). Another important group of terms are related to the manipulation of sets of objects, which is an important use of mathematics in social science work (sets are simply deﬁned groupings of individual objects; see Chapter 7, where sets and operations on sets are deﬁned in detail). The most common are Symbol ∅ ∪ ∩ \ ⊂ Explanation the empty set (sometimes used with the Greek phi: φ) union of sets intersection of sets subtract from set subset complement

These allow us to make statements about groups of objects such as A ⊂ B for A = {2, 4}, B = {2, 4, 7}, meaning that the set A is a smaller grouping of the larger set B. We could also observe that the A results from removing seven from B. Some symbols, however, are “restricted” to comparing or operating on strictly numerical values and are not therefore applied directly to sets or logic expressions. We have already seen the sum and product operators given by the symbols and accordingly. The use of ∞ for inﬁnity is relatively common even

outside of mathematics, but the next list also gives two distinct “ﬂavors” of

1.4 Basic Terms

17

inﬁnity. Some of the contexts of these symbols we will leave to remaining chapters as they deal with notions like limits and vector quantities. Symbol ∝ . = ⊥ ∞ ∞+ , +∞ ∞ , −∞
−

Explanation is proportional to equal to in the limit (approaches) perpendicular inﬁnity positive inﬁnity negative inﬁnity summation product ﬂoor: round down to nearest integer ceiling: round up to nearest integer

|

given that: X|Y = 3

Related to these is a set of functions relating maximum and minimum values. Note the directions of ∨ and ∧ in the following table. Symbol ∨ max() ∧ min() argmaxf (x)
x

Explanation maximum of two values maximum value from list minimum of two values minimum value from list the value of x that maximizes the function f (x) the value of x that minimizes the function f (x)

argminf (x)
x

The latter two are important but less common functions. Functions are formally deﬁned in the next section, but we can just think of them for now as sets of instructions for modifying input values (x2 is an example function that squares its input). As a simple example of the argmax function, consider argmax x(1 − x),
x∈R

which asks which value on the real number line maximizes x(1 − x). The answer is 0.5 which provides the best trade-off between the two parts of the

18

The Basics

function. The argmin function works accordingly but (obviously) operates on the function minimum instead of the function maximum. These are not exhaustive lists of symbols, but they are the most fundamental (many of them are used in subsequent chapters). Some literatures develop their own conventions about symbols and their very own symbols, such as denote a mathematical representation of a game and to

to indicate geometric

equivalence between two objects, but such extensions are rare in the social sciences.

1.5 Functions and Equations A mathematical equation is a very general idea. Fundamentally, an equation “equates” two quantities: They are arithmetically identical. So the expression R = P B − C is an equation because it establishes that R and P B − C are exactly equal to each other. But the idea of a mathematical sentence is more general (less restrictive) than this because we can substitute other relations for equality, such as Symbol < ≤ > ≥ ≈ ∼ = Meaning less than less than or equal to much less than greater than greater than or equal to much greater than approximately the same approximately equal to approximately less than (also ≡ equivalent by assumption ) ) approximately greater than (also

So, for example, if we say that X = 1, Y = 1.001 and Z = 0.002, then the following statements are true:

The purpose of the equation form is generally to express more than one set of relations. Most of us remember the task of solving “two equations for two unknowns.” Such forms enable us to describe how (possibly many) variables are associated to each other and various constants. The formal language of mathematics relies heavily on the idea that equations are the atomic units of relations. What is a function? A mathematical function is a “mapping” (i.e., speciﬁc directions), which gives a correspondence from one measure onto exactly one other for that value. That is, in our context it deﬁnes a relationship between one variable on the x-axis of a Cartesian coordinate system and an operation on that variable that can produce only one value on the y-axis. So a function is a mapping from one deﬁned space to another, such as f : R → R, in which f maps the real numbers to the real numbers (i.e., f (x) = 2x), or f : R → I, in which f maps the real numbers to the integers (i.e., f (x) = round(x)). This all sounds very technical, but it is not. One way of thinking about functions is that they are a “machine” for transforming values, sort of a box as in the ﬁgure to the right. To visualize this we can think about values, x, going in and some modiﬁcation of these values, f (x), coming out where the instructions for this process are x f () f (x) A Function Represented

contained in the “recipe” given by f (). Consider the following function operating on the variable x: f (x) = x2 − 1. This simply means that the mapping from x to f (x) is the process that squares x and subtracts 1. If we list a set of inputs, we can deﬁne the corresponding set of outputs, for example, the paired values listed in Table 1.1. Here we used the f () notation for a function (ﬁrst codiﬁed by Euler in the eighteenth century and still the most common form used today), but other forms are only slightly less common, such as: g(), h(), p(), and u(). So we could have just as readily said: g(x) = x2 − 1. Sometimes the additional notation for a function is essential, such as when more than one function is used in the same expression. For instance, functions can be “nested” with respect to each other (called a composition): f ◦ g = f (g(x)), as in g(x) = 10x and f (x) = x2 , so f ◦ g = (10x)2 (note that this is different than g ◦ f , which would be 10(x2 )). Function deﬁnitions can also contain wording instead of purely mathematical expressions and may have conditional

Note that the ﬁrst example is necessarily a noncontinuous function whereas the second example is a continuous function (but perhaps not obviously so). Recall that π is notation for 3.1415926535. . . , which is often given inaccurately as just 3.14 or even 22/7. To be more speciﬁc about such function characteristics, we now give two important properties of a function. Properties of Functions, Given for g(x) = y A function is continuous if it has no “gaps” in its mapping from x to y. A function is invertible if its reverse operation exists: g −1 (y) = x, where g −1 (g(x)) = x.

It is important to distinguish between a function and a relation. A function must have exactly one value returned by f (x) for each value of x, whereas a relation does not have this restriction. One way to test whether f (x) is a function or, more generally, a relation is to graph it in the Cartesian coordinate system (x versus y in orthogonal representation) and see if there is a vertical line that can be drawn such that it intersects the function at two values (or more) of y for a single value of x. If this occurs, then it is not a function. There is an important distinction to be made here. The solution to a function can possibly have more than one corresponding value of x, but a

22

The Basics

function cannot have alternate values of y for a given x. For example, consider the relation y 2 = 5x, which is not a function based on this criteria. We can see √ this algebraically by taking the square root of both sides, ±y = 5x, which shows the non-uniqueness of the y values (as well as the restriction to positive values of x). We can also see this graphically in Figure 1.2, where x values from √ 0 to 10 each give two y values (a dotted line is given at (x = 4, y = ± 20) as an example).
Fig. 1.2. A Relation That Is Not a Function

4

6

−6

−4

−2

0

2

y2 = 5x

0

2

4

6

8

10

The modern deﬁnition of a function is also attributable to Dirichlet: If variables x and y are related such that every acceptable value of x has a corresponding value of y deﬁned by a rule, then y is a function of x. Earlier European period notions of a function (i.e., by Leibniz, Bernoulli , and Euler) were more vague and sometimes tailored only to speciﬁc settings.

1.5 Functions and Equations
Fig. 1.3. Relating x and f (x)

23

200

150

24
<− −10 −5 0 5 10 −>
0 1 2 3 4 5 6

32

f(x) = x2 − 1

50

0

x unbounded

x bounded by 0 and 6

Often a function is explicitly deﬁned as a mapping between elements of an ordered pair : (x, y), also called a relation. So we say that the function f (x) = y maps the ordered pair x, y such that for each value of x there is exactly one y (the order of x before y matters). This was exactly what we saw in Table 1.1, except that we did not label the rows as ordered pairs. As a more concrete example, the following set of ordered pairs: {[1, −2], [3, 6], [7, 46]} can be mapped by the function: f (x) = x2 −3. If the set of x values is restricted to some speciﬁcally deﬁned set, then obviously so is y. The set of x values is called the domain (or support) of the function and the associated set of y values is called the range of the function. Sometimes this is highly restrictive (such as to speciﬁc integers) and sometimes it is not. Two examples are given in Figure 1.3, which is drawn on the (now) familiar Cartesian coordinate system. Here we see that the range and domain of the function are unbounded in the ﬁrst panel (although we clearly cannot draw it all the way until inﬁnity in both

0

8

f(x) = x − 1
2

100

16

24

The Basics

directions), and the domain is bounded by 0 and 6 in the second panel. A function can also be even or odd, deﬁned by a function is “odd” if: a function is “even” if: f (−x) = −f (x) f (−x) = f (x).

So, for example, the squaring function f (x) = x2 and the absolute value function f (x) = |x| are even because both will always produce a positive answer. On the other hand, f (x) = x3 is odd because the negative sign perseveres for a negative x. Regretfully, functions can also be neither even nor odd without domain restrictions. One special function is important enough to mention directly here. A linear function is one that preserves the algebraic nature of the real numbers such that f () is a linear function if: f (x1 + x2 ) = f (x1 ) + f (x2 ) and f (kx1 ) = kf (x1 )

for two points, x1 and x2 , in the domain of f () and an arbitrary constant number k. This is often more general in practice with multiple functions and multiple constants, forms such as: F (x1 , x2 , x3 ) = kf (x1 ) + g(x2 ) + mh(x3 ) for functions f (), g(), h() and constants k, , m. Example 1.3: The “Cube Rule” in Votes to Seats. A standard, though

somewhat maligned, theory from the study of elections is due to Parker’s (1909) empirical research in Britain, which was later popularized in that country by Kendall and Stuart (1950, 1952). He looked at systems with two major parties whereby the largest vote-getter in a district wins regardless of the size of the winning margin (the so-called ﬁrst past the post system used by most English-speaking countries). Suppose that A denotes the proportion of votes for one party and B the proportion of votes for the other. Then, according to this rule, the ratio of seats in Parliament won is approximately the cube of the ratio of votes: A/B in votes implies A3 /B 3 in seats

1.5 Functions and Equations

25

(sometimes ratios are given in the notation A : B). The political principle from this theory is that small differences in the vote ratio yield large differences in the seats ratio and thus provide stable parliamentary government. So how can we express this theory in standard mathematical function notation. Deﬁne x as the ratio of votes for the party with proportion A over the party with proportion B. Then expressing the cube law in this notation yields f (x) = x3 for the function determining seats, which of course is very simple. Tufte (1973) reformulated this slightly by noting that in a two-party contest the proportion of votes for the second party can be rewritten as B = 1 − A. Furthermore, if we deﬁne the proportion of seats for the ﬁrst party as SA , then similarly the proportion of seats for the second party is 1 − SA , and we can reexpress the cube rule in this notation as SA A = 1 − SA 1−A
3

This equation has an interesting shape with a rapid change in the middle of the range of A, clearly showing the nonlinearity in the relationship implied by the cube function. This shape means that the winning party’s gains are more pronounced in this area and less dramatic toward the tails. This is shown in Figure 1.4.

Taagepera (1986) looked at this for a number of elections around the world and found some evidence that the rule ﬁts. For instance, U.S. House races for the period 1950 to 1970 with Democrats over Republicans give a value of exactly 2.93, which is not too far off the theoretical value of 3 supplied by Parker.

26

The Basics
Fig. 1.4. The Cube Law

1

SA

(0.5,0.5)

0

A

1

1.5.1 Applying Functions: The Equation of a Line Recall the familiar expression of a line in Cartesian coordinates usually given as y = mx + b, where m is the slope of the line (the change in y for a one-unit change in x) and b is the point where the line intercepts the y-axis. Clearly this is a (linear) function in the sense described above and also clearly we can determine any single value of y for a given value of x, thus producing a matched pair. A classic problem is to ﬁnd the slope and equation of a line determined by two points. This is always unique because any two points in a Cartesian coordinate system can be connected by one and only one line. Actually we can generalize this in a three-dimensional system, where three points determine a unique plane, and so on. This is why a three-legged stool never wobbles and a fourlegged chair sometimes does (think about it!). Back to our problem. . . suppose that we want to ﬁnd the equation of the line that goes through the two points

1.5 Functions and Equations

27

{[2, 1], [3, 5]}. What do we know from this information? We know that for one unit of increasing x we get four units of increasing y. Since slope is “rise over run,” then: m= 5−1 = 4. 3−2

Great, now we need to get the intercept. To do this we need only to plug m into the standard line equation, set x and y to one of the known points on the line, and solve (we should pick the easier point to work with, by the way): y = mx + b 1 = 4(2) + b b = 1 − 8 = −7. This is equivalent to starting at some selected point on the line and “walking down” until the point where x is equal to zero.

Fig. 1.5. Parallel and Perpendicular Lines

10

8

6

y

y
4 2 0

0

2

4

6

8

10

−1

0

1

2

3

4

5

6

−1

0

1

2

3

4

5

6

x

x

28

The Basics

The Greeks and other ancients were fascinated by linear forms, and lines are an interesting mathematical subject unto themselves. For instance, two lines y = m1 x + b 1 y = m2 x + b 2 , are parallel if and only if (often abbreviated as “iff”) m1 = m2 and perpendicular (also called orthogonal) iff m1 = −1/m2. For example, suppose we have the line L1 : y = −2x + 3 and are interested in ﬁnding the line parallel to L1 that goes through the point [3, 3]. We know that the slope of this new line must be −2, so we now plug this value in along with the only values of x and y that we know are on the line. This allows us to solve for b and plot the parallel line in left panel of Figure 1.5: (3) = −2(3) + b2 , so b2 = 9.

This means that the parallel line is given by L2 : y = −2x + 9. It is not much more difﬁcult to get the equation of the perpendicular line. We can do the same trick but instead plug in the negative inverse of the slope from L1 : (3) = 1 (3) + b3 , 2 so b3 = 3 . 2

1 This gives us L2 ⊥ L1 , where L2 : y = 2 x + 3 . 2

Example 1.4:

Child Poverty and Reading Scores. Despite overall na-

tional wealth, a surprising number of U.S. school children live in poverty. A continuing concern is the effect that this has on educational development and attainment. This is important for normative as well as societal reasons. Consider the following data collected in 1998 by the California Department of Education (CDE) by testing all 2nd–11th grade students in various subjects (the Stanford 9 test). These data are aggregated to the school district level here for two variables: the percentage of students who qualify for reduced or free lunch plans (a common measure of poverty in educational policy studies)

1.5 Functions and Equations

29

and the percent of students scoring over the national median for reading at the 9th grade. The median (average) is the point where one-half of the points are greater and one-half of the points are less. Because of the effect of limited English proﬁciency students on district performance, this test was by far the most controversial in California amongst the exam topics. In addition, administrators are sensitive to the aggregated results of reading scores because it is a subject that is at the core of what many consider to be “traditional” children’s education. The relationship is graphed in Figure 1.6 along with a linear trend with a slope of m = −0.75 and an intercept at b = 81. A very common tool of social scientists is the so-called linear regression model. Essentially this is a method of looking at data and ﬁguring out an underlying trend in the form of a straight line. We will not worry about any of the calculation details here, but we can think about the implications. What does this particular line mean? It means that for a 1% positive change (say from 50 to 51) in a district’s poverty, they will have an expected reduction in the pass rate of three-quarters of a percent. Since this line purports to ﬁnd the underlying trend across these 303 districts, no district will exactly see these results, but we are still claiming that this captures some common underlying socioeconomic phenomena.

Percent Receiving Subsidized Lunch
It should be clear that this function grows rapidly for increasing values of x, and sometimes the result overwhelms commonly used hand calculators. Try, for instance, to calculate 100! with yours. In some common applications large factorials are given in the context of ratios and a handy cancellation can be used to make the calculation easier. It would be difﬁcult or annoying to calculate 190!/185! by ﬁrst obtaining the two factorials and then dividing. Fortunately we can use 190 · 189 · 188 · 187 · 186 · 185 · 184 · 183 · . . . 190! = 185! 185 · 184 · 183 · . . . = 190 · 189 · 188 · 187 · 186 = 234, 816, 064, 560 (recall that “·” and “×” are equivalent notations for multiplication). It would not initially seem like this calculation produces a value of almost 250 billion, but it does! Because factorials increase so quickly in magnitude, they can

1.5 Functions and Equations

31

sometimes be difﬁcult to calculate directly. Fortunately there is a handy way to get around this problem called Stirling’s Approximation (curiously named since it is credited to De Moivre’s 1720 work on probability): n! ≈ (2πn) 2 e−n nn . Here e ≈ 2.71, which is an important constant deﬁned on page 36. Notice that, as its name implies, this is an approximation. We will return to factorials in Chapter 7 when we analyze various counting rules.
1

Example 1.5:

Coalition Cabinet Formation. Suppose we are trying to

form a coalition cabinet with three parties. There are six senior members of the Liberal Party, ﬁve senior members of the Christian Democratic Party, and four senior members of the Green Party vying for positions in the cabinet. How many ways could you choose a cabinet composed of three Liberals, two Christian Democrats, and three Greens? It turns out that the number of possible subsets of y items from a set of n items is given by the “choose notation” formula: n y = n! , y!(n − y)!

which can be thought of as the permutations of n divided by the permutations of y times the permutations of “not y.” This is called unordered without
replacement because it does not matter what order the members are drawn

in, and once drawn they are not thrown back into the pool for possible reselection. There are actually other ways to select samples from populations, and these are given in detail in Chapter 7 (see, for instance, the discussion in Section 7.2). So now we have to multiply the number of ways to select three Liberals, the two CDPs, and the three Greens to get the total number of possible cabinets (we multiply because we want the full number of combinatoric possibilities

32 across the three parties): 6 3 5 2 4 3 = =

The Basics

5! 4! 6! 3!(6 − 3)! 2!(5 − 2)! 3!(4 − 3)! 720 120 24 6(6) 2(6) 6(1)

= 20 × 10 × 4 = 800. This number is relatively large because of the multiplication: For each single choice of members from one party we have to consider every possible choice from the others. In a practical scenario we might have many fewer politically viable combinations due to overlapping expertise, jealousies, rivalries, and other interesting phenomena.

1.5.3 The Modulo Function Another function that has special notation is the modulo function, which deals with the remainder from a division operation. First, let’s deﬁne a factor: y is a factor of x if the result of x/y is an integer (i.e., a prime number has exactly two factors: itself and one). So if we divided x by y and y was not a factor of x, then there would necessarily be a noninteger remainder between zero and one. This remainder can be an inconvenience where it is perhaps discarded, or it can be considered important enough to keep as part of the result. Suppose instead that this was the only part of the result from division that we cared about. What symbology could we use to remove the integer component and only keep the remainder? To divide x by y and keep only the remainder, we use the notation x (mod y).

1.6 Polynomial Functions ple. The modulo function is also sometimes written as either x mod y (only the spacing differs). or x mod y

33

1.6 Polynomial Functions Polynomial functions of x are functions that have components that raise x to some power: f (x) = x2 + x + 1 g(x) = x5 − 33 − x h(x) = x100 , where these are polynomials in x of power 2, 5, and 100, respectively. We have already seen examples of polynomial functions in this chapter such as f (x) = x2 , f (x) = x(1 − x), and f (x) = x3 . The convention is that a polynomial degree (power) is designated by its largest exponent with regard to the variable. Thus the polynomials above are of degree 2,5, and 100, respectively. Often we care about the roots of a polynomial function: where the curve of the function crosses the x-axis. This may occur at more than one place and may be difﬁcult to ﬁnd. Since y = f (x) is zero at the x-axis, root ﬁnding means discovering where the right-hand side of the polynomial function equals zero. Consider the function h(x) = x100 from above. We do not have to work too hard to ﬁnd that the only root of this function is at the point x = 0. In many scientiﬁc ﬁelds it is common to see quadratic polynomials, which are just polynomials of degree 2. Sometimes these polynomials have easy-todetermine integer roots (solutions), as in x2 − 1 = (x − 1)(x + 1) =⇒ x = ±1, and sometimes they do not, requiring the well-known quadratic equation √ −b ± b2 − 4ac x= , 2a

34

The Basics

where a is the multiplier on the x2 term, b is the multiplier on the x term, and c is the constant. For example, solving for roots in the equation

x2 − 4x = 5

is accomplished by −(−4) ± (−4)2 − 4(1)(−5) = −1 or 5, 2(1)

x=

where a = 1, b = −4, and c = −5 from f (x) = x2 − 4x − 5 ≡ 0.

1.7 Logarithms and Exponents Exponents and logarithms (“logs” for short) confuse many people. However, they are such an important convenience that they have become critical to quantitative social science work. Furthermore, so many statistical tools use these “natural” expressions that understanding these forms is essential to some work. Basically exponents make convenient the idea of multiplying a number by itself (possibly) many times, and a logarithm is just the opposite operation. We already saw one use of exponents in the discussion of the cube rule relating votes to seats. In that example, we deﬁned a function, f (x) = x3 , that used 3 as an exponent. This is only mildly more convenient than f (x) = x × x × x, but imagine if the exponent was quite large or if it was not a integer. Thus we need some core principles for handling more complex exponent forms. First let’s review the basic rules for exponents. The important ones are as follows.

The underlying principle that we see from these rules is that multiplication of the base (x here) leads to addition in the exponents (a and b here), but multiplication in the exponents comes from nested exponentiation, for example, (xa )b = xab from above. One point in this list is purely notational: Power(x, a) comes from the computer expression of mathematical notation. A logarithm of (positive) x, for some base b, is the value of the exponent that gets b to x: logb (x) = a =⇒ ba = x. A frequently used base is b = 10, which deﬁnes the common log. So, for example, log10 (100) = 2 log10 (0.1) = −1 log10 (15) = 1.176091 =⇒ 102 = 100 =⇒ 10−1 = 0.1 =⇒ 101.1760913 = 15.

36 Another common base is b = 2:

The Basics

log2 (8) = 3 log2 (1) = 0 log2 (15) = 3.906891

=⇒ 23 = 8 =⇒ 20 = 1 =⇒ 23.906891 = 15.

Actually, it is straightforward to change from one logarithmic base to another. Suppose we want to change from base b to a new base a. It turns out that we only need to divide the ﬁrst expression by the log of the new base to the old
base :

A third common base is perhaps the most interesting. The natural log is the log with the irrational base: e = 2.718281828459045235 . . .. This does not

1.7 Logarithms and Exponents

37

seem like the most logical number to form a useful base, but in fact it turns out to be so. This is an enormously important constant in our numbering system and appears to have been lurking in the history of mathematics for quite some time, however, without substantial recognition. Early work on logarithms in the seventeenth century by Napier, Oughtred, Saint-Vincent, and Huygens hinted at the importance of e, but it was not until Mercator published a table of “natural logarithms” in 1668 that e had an association. Finally, in 1761 e acquired its current name when Euler christened it as such. Mercator appears not to have realized the theoretical importance of e, but soon thereafter Jacob Bernoulli helped in 1683. He was analyzing the (nowfamous) formula for calculating compound interest, where the compounding is done continuously (rather than a set intervals): f (p) = 1+ 1 p
p

.

Bernoulli’s question was, what happens to this function as p goes to inﬁnity? The answer is not immediately obvious because the fraction inside goes to zero, implying that the component within the parenthesis goes to one and the exponentiation does not matter. But does the fraction go to zero faster than the exponentiation grows ever larger? Bernoulli made the surprising discovery that this function in the limit (i.e., as p → ∞) must be between 2 and 3. Then what others missed Euler made concrete by showing that the limiting value of this function is actually e. In addition, he showed that the answer to Bernoulli’s question could also be found by e=1+ (sometimes given as e = 1 1 1 1 + + + + ... 1! 2! 3! 4! Clearly this (Euler’s expansion)

1 2 3 4 1! + 2! + 3! + 4! +. . .).

is a series that adds declining values because the factorial in the denominator will grow much faster than the series of integers in the numerator. Euler is also credited with being the ﬁrst (that we know of) to show that e, like π, is an irrational number: There is no end to the series of nonrepeating

38

The Basics

numbers to the right of the decimal point. Irrational numbers have bothered mankind for much of their recognized existence and have even had negative connotations. One commonly told story holds that the Pythagoreans put one of their members to death after he publicized the existence of irrational numbers. The discovery of negative numbers must have also perturbed the Pythagoreans because they believe in the beauty and primacy of natural numbers (that the √ diagonal of a square with sides equal to one unit has length 2 and that caused them great consternation). It turns out that nature has an afﬁnity for e since it appears with great regularity among organic and physical phenomena. This makes its use as a base for the log function quite logical and supportable. As an example from biology, the chambered nautilus (nautilus pompilius) forms a shell that is characterized as “equiangular” because the angle from the source radiating outward is constant as the animal grows larger. Aristotle (and other ancients) noticed this as well as the fact that the three-dimensional space created by growing new chambers always has the same shape, growing only in magnitude. We can illustrate this with a cross section of the shell created by a growing spiral of consecutive right triangles (the real shell is curved on the outside) according to

x = r × ekθ cos(θ)

y = r × ekθ sin(θ),

where r is the radius at a chosen point, k is a constant, θ is the angle at that point starting at the x-axis proceeding counterclockwise, and sin, cos are functions that operate on angles and are described in the next chapter (see page 56). Notice the centrality of e here, almost implying that these mulluscs sit on the ocean ﬂoor pondering the mathematical constant as they produce shell chambers. A two-dimensional cross section is illustrated in Figure 1.7 (k = 0.2, going around two rotations), where the characteristic shape is obvious even with the triangular simpliﬁcation. Given the central importance of the natural exponent, it is not surprising that

1.7 Logarithms and Exponents
Fig. 1.7. Nautilus Chambers
0.4

39

y
−0.8 −0.4

0.0

0.2

−0.5

0.0

0.5

1.0

x

the associated logarithm has its own notation: loge (x) = ln(x) = a =⇒ ea = x, and by the deﬁnition of e ln(ex ) = x. This inner function (ex ) has another common notational form, exp(x), which comes from expressing mathematical notation on a computer. There is another notational convention that causes some confusion. Quite frequently in the statistical literature authors will use the generic form log() to denote the natural logarithm based on e. Conversely, it is sometimes defaulted to b = 10 elsewhere (often engineering and therefore less relevant to the social sciences). Part of the reason for this shorthand for the natural log is the pervasiveness of e in the

40

The Basics

mathematical forms that statisticians care about, such as the form that deﬁnes the normal probability distribution.

The relationship between Fahrenheit and Centigrade can be expressed as 5f − 9c = 160. Show that this is a linear function by putting it in y = mx + b format with c = y. Graph the line indicating slope and intercept.

1.5

Another way to describe a line in Cartesian terms is the point-slope form: (y − y ) = m(x − x ), where y and x are given values and m is the slope of the line. Show that this is equivalent to the form given by solving for the intercept.

A very famous sequence of numbers is called the Fibonacci sequence, which starts with 0 and 1 and continues according to: 0, 1, 1, 2, 3, 5, 8, 13, 21, . . .

Exercises

43

Figure out the logic behind the sequence and write it as a function using subscripted values like xj for the jth value in the sequence. 1.8 In the example on page 24, the cube law was algebraically rearranged to solve for SA . Show these steps. 1.9 Which of the following functions are continuous? If not, where are the discontinuities? f (x) = 9x3 − x (x − 1)(x + 1)
2

Find the equation of the line that goes through the two points {[−1, −2], [3/2, 5/2]}.

1.11

Use the diagram of the square to prove that (a − b)2 + 4ab = (a + b)2
a b

(i.e.,

demonstrate

this equality geometrically than rather

algebraically

with features of the square shown).

1.12

Suppose we are trying to put together a Congressional committee that has representation from four national regions. Potential members are drawn from a pool with 7 from the northeast, 6 from the south, 4 from the Midwest, and 6 from the far west. How many ways can you choose a committee that has 3 members from each region for a total of 12?

44 1.13

The Basics Sørensen’s (1977) model of social mobility looks at the process of increasing attainment in the labor market as a function of time, personal qualities, and opportunities. Typical professional career paths follow a logarithmic-like curve with rapid initial advancement and tapering off progress later. Label yt the attainment level at time period t and yt−1 the attainment in the previous period, both of which are deﬁned over R+ . Sørensen stipulates: yt = r [exp(st) − 1] + yt−1 exp(st), s

where r ∈ R+ is the individual’s resources and abilities and s ∈ R+ is the structural impact (i.e., a measure of opportunities that become available). What is the domain of s, that is, what restrictions are necessary on what values it can take on in order for this model to make sense in that declining marginal manner? 1.14 The following data are U.S. Census Bureau estimates of population over a 5-year period. Date July 1, 2004 July 1, 2003 July 1, 2002 July 1, 2001 July 1, 2000 Total U.S. Population 293,655,404 290,788,976 287,941,220 285,102,075 282,192,162

Characterize the growth in terms of a parametric expression. Graphing may help. 1.15 Using the change of base formula for logarithms, change log6 (36) to log3 (36). 1.16 Glottochronology is the anthropological study of language change and evolution. One standard theory (Swadish 1950,1952) holds that words endure in a language according to a “decay rate” that can be expressed as y = c2t , where y is the proportion of words that are retained in a

Exercises

45

language, t is the time in 1000 years, and c = 0.805 is a constant. Reexpress the relation using “e” (i.e., 2.71. . . ), as is done in some settings, according to y = e−t/τ , where τ is a constant you must specify. Van der Merwe (1966) claims that the Romance-GermanicSlavic language split ﬁts a curve with τ = 3.521. Graph this curve and the curve from τ derived above with an x-axis along 0 to 7. What does this show? 1.17 Sociologists Holland and Leinhardt (1970) developed measures for models of structure in interpersonal relations using ranked clusters. This approach requires extensive use of factorials to express personal choices. The authors deﬁned the notation x(k) = x(x − 1)(x − 2) · · · (x − k + 1). Show that x(k) is just x!/(x − k)!. 1.18 For the equation y 3 = x2 + 2 there is only one solution where x and y are both positive integers. Find this solution. For the equation y 3 = x2 + 4 there are only two solutions where x and y are both positive integers. Find them both. 1.19 Show that in general
m n n m

xi yj =
i=1 j=1 j=1 i=1

xi yj

and construct a special case where it is actually equal. 1.20 A perfect number is one that is the sum of its proper divisors. The ﬁrst ﬁve are 6=1+2+3 28 = 1 + 2 + 4 + 7 + 14 496 = 1 + 2 + 4 + 8 + 16 + 31 + 62 + 124 + 248. Show that 8128 are 33550336 perfect numbers. The Pythagoreans also deﬁned abundant numbers: The number is less than the sum of its divisors, and deﬁcient numbers: The number is greater than the sum of its divisors. Any divisor of a deﬁcient number or perfect number

46

The Basics turns out to be a deﬁcient number itself. Show that this is true with 496. There is a function that relates perfect numbers to primes that comes from Euclid’s Elements (around 300 BC). If f (x) = 2x − 1 is a prime number, then g(x) = 2x−1 (2x − 1) is a perfect number. Find an x for the ﬁrst three perfect numbers above.

1.21

Suppose we had a linear regression line relating the size of state-level unemployment percent on the x-axis and homicides per 100,000 of the state population on the y-axis, with slope m = 2.41 and intercept b = 27. What would be the expected effect of increasing unemployment by 5%?

The manner by which seats are allocated in the House of Representatives to the 50 states is somewhat more complicated than most people

Exercises

47

appreciate. The current system (since 1941) is based on the “method of equal proportions” and works as follows: • Allocate one representative to each state regardless of population. • Divide each state’s population by a series of values given by the formula i(i − 1) starting at i = 2, which looks like this for state j with population pj : p p p √ j , √ j , √ j ,... 2×1 3×2 4×3 where n is a large number. • These values are sorted in descending order for all states and House seats are allocated in this order until 435 are assigned. (a) The following are estimated state “populations” for the original 13 states in 1780 (Bureau of the Census estimates; the ﬁrst ofﬁcial U.S. census was performed later in 1790): Virginia Massachusetts Pennsylvania North Carolina New York Maryland Connecticut South Carolina New Jersey New Hampshire Georgia Rhode Island Delaware 538,004 268,627 327,305 270,133 210,541 245,474 206,701 180,000 139,627 87,802 56,071 52,946 45,385 pj , n × (n − 1)

Calculate under this plan the apportionment for the ﬁrst House of Representatives that met in 1789, which had 65 members.

48

The Basics (b) The ﬁrst apportionment plan was authored by Alexander Hamilton and uses only the proportional value and rounds down to get full persons (it ignores the remainders from fractions), and any remaining seats are allocated by the size of the remainders to give (10, 8, 8, 5, 6, 6, 5, 5, 4, 3, 3, 1, 1) in the order above. Relatively speaking, does the Hamilton plan favor or hurt large states? Make a graph of the differences. (c) Show by way of a graph the increasing proportion of House representation that a single state obtains as it grows from the smallest to the largest in relative population.

1.27

The Nachmias–Rosenbloom Measure of Variation (MV) indicates how many heterogeneous intergroup relationships are evident from the full set of those mathematically possible given the population. Speciﬁcally it is described in terms of the “frequency” (their original language) of observed subgroups in the full group of interest. Call fi the frequency or proportion of the ith subgroup and n the number of these groups. The index is created by MV = “each frequency × all others, summed” “number of combinations” × “mean frequency squared”
n i=1 (fi = fj )fi fj n(n−1) ¯2 f 2

=

.

Nachmias and Rosenbloom (1973) use this measure to make claims about how integrated U.S. federal agencies are with regard to race. For a population of 24 individuals: (a) What mixture of two groups (say blacks and whites) gives the maximum possible MV? Calculate this value. (b) What mixture of two groups (say blacks and whites) gives the minimum possible MV but still has both groups represented? Calculate this value as well.

The following table lists the Greek characters encountered in standard mathematical language along with a very short description of the standard way that each is considered in the social sciences (omicron is not used).

Name alpha beta gamma

Lowercase α β γ

Capitalized – – Γ

Typical Usage general unknown value general unknown value small case a general unknown value, capitalized version denotes a special counting function often used to denote a difference usually denotes a very small number or error general unknown value general unknown value general unknown value, often used for radians

delta epsilon zeta eta theta

δ

∆ –

ζ η θ ι κ λ µ

– – Θ – – Λ –

iota kappa lambda

rarely used general unknown value general unknown value, used for eigenvalues

mu

general unknown value, denotes a mean in statistics

50 Name nu xi pi Lowercase ν ξ π

The Basics Capitalized – Ξ Π Typical Usage general unknown value general unknown value small case can be: 3.14159. . . , general unknown value, a probability function; capitalized version should not be confused with product notation general unknown value, simple correlation, or autocorrelation in time-series statistics small case can be unknown value or a variance (when squared), capitalized version should not be confused with summation notation general unknown value general unknown value general unknown value, sometimes denotes the two expressions of the normal distribution general unknown value, sometimes denotes the chi-square distribution (when squared) general unknown value general unknown value

rho

ρ

–

sigma

σ

Σ

tau upsilon phi

τ υ φ

– Υ Φ

chi

χ

–

psi omega

ψ ω

Ψ Ω

2
Analytic Geometry

2.1 Objectives (the Width of a Circle) This chapter introduces the basic principles of analytic geometry and trigonometry speciﬁcally. These subjects come up in social science research in seemingly surprising ways. Even if one is not studying some spatial phenomenon, such functions and rules can still be relevant. We will also expand beyond Cartesian coordinates and look at polar coordinate systems. At the end of the day, understanding trigonometric functions comes down to understanding their basis in triangles.

2.2 Radian Measurement and Polar Coordinates So far we have only used Cartesian coordinates when discussing coordinate systems. There is a second system that can be employed when it is convenient to think in terms of a movement around a circle. Radian measurement treats the angular distance around the center of a circle (also called the pole or origin for obvious reasons) in the counterclockwise direction as a proportion of 2π. Most people are comfortable with another measure of angles, degrees, which are measured from 0 to 360. However, this system is arbitrary (although ancient) 51

52

Analytic Geometry

whereas radian measurement is based on the formula for the circumference of a circle: c = 2πr, where r is the radius. If we assume a unit radius (r = 1), then the linkage is obvious. That is, from a starting point, moving 2π around the circle (a complete revolution) returns us to the radial point where we began. So 2π is equal to 360o in this context (more speciﬁcally for the unit circle described below).

This means that we can immediately translate between radians and degrees for angles simply by multiplying a radian measure by 360/2π or a degree measure by 2π/360. Figure 2.1 shows examples for angles of size θ = π/2 and θ = 2π, where r = 1 by assumption. Note that the direction is always assumed to be counterclockwise for angles speciﬁed in the positive direction. Negative radian measurement also makes sense because sometimes it will be convenient to move in the negative, that is, clockwise direction. In addition, it is also important to remember that the system also “restarts” at 2π in this direction as well, meaning the function value becomes zero. This means that going in opposite directions has interesting implications. For instance, positive
3 and negative angular distances have interesting equalities, such as − 2 π = 1 π. 2

The assumption of a unit-radius circle is just a convenience here. It turns out that, simply by standardizing, we can get the same angle measurement for any size circle. Suppose that r is the distance moving away from the origin and θ is the radian measurement. Then for different values of r we have wider circles but the same angle, as shown in the top panel of Figure 2.2. Actually this clariﬁcation suggests a more broad coordinate deﬁnition. Polar coordinates are a system in which points on a plane are described by the number pair (θ, r), where θ is the radian measure and r is a distance from the origin. The bottom panel of Figure 2.2 gives some example points. It is actually quite easy to change between polar and Cartesian coordinates. The following table gives the required transformations.

Polar to Cartesian x = r cos(θ) y = r sin(θ)

Cartesian to Polar
y θ = arctan( x )

r=

x2 + y 2

54

Analytic Geometry
Fig. 2.2. Polar Coordinates

r1 θ2 r2

θ1

[0.6π,9]

[π 4,12]

[(7 4)π,18]

2.3 What Is Trigonometry? This section provides a short overview of trigonometric (or circular) functions. These ideas are very basic, but sometimes people can get intimidated by the language. Others may have bad memories from high school mathematics classes, which involved for some lots of pointless memorization.

2.3 What Is Trigonometry? The topic of trigonometry started as the study of triangles. Consider the angle θ of a right triangle as shown in the ﬁgure at the right. The Greeks were interested in the ratios of the sizes of the sides of this triangle, and they noticed that these could be related

55

Right Triangle

r

y

x

The Greeks were also very interested in the properties of right triangles in particular and considered them to be important cases (they are!). Of course the Pythagorean Theorem (a theorem is just a provable mathematical assertion) applies here, but there are additional relations of interest. The basic relations involve ratios of sides related to the acute angle (i.e., θ less than 90o ). There are six core deﬁnitions for the even functions cosine and secant, as well as the odd functions sine, cosecant, tangent, and cotangent, given by sin(θ) = cos(θ) = tan(θ) = y r x r y x csc(θ) = sec(θ) = cot(θ) = r y r x x . y

The sine (sin), cosine (cos), and tangent (tan) functions just given are the key foundations for the rest of the trigonometric functions. Note also that these are explicitly functions of θ. That is, changes in the angle θ force changes in the size of the sides of the triangle. There are also reciprocal relations: sin(θ) = csc(θ)−1 , cos(θ) = sec(θ)−1 ,

¡

to the angle of θ.

θ

56

Analytic Geometry

and tan(θ) = cot(θ)−1 . Also from these inverse properties we get sin(θ) csc(θ) = 1, cos(θ) sec(θ) = 1, and tan(θ) cot(θ) = 1. This implies that, of the six basic trigonometric functions, sine, cosine, and tangent are the more fundamental.

Fig. 2.3. A General Trigonometric Setup

θ x

y r

The original deﬁnition of the basic trigonometric functions turns out to be too restrictive because deﬁning with a triangle necessarily restricts us to acute angles (we could be slightly more general by using triangles with obtuse angles). To expand the scope to a full 360o we need to be broad and use a full Cartesian coordinate system. Figure 2.3 shows the more general setup where the triangle deﬁned by sweeping the angle θ counterclockwise from the positive side of the x-axis along with an r value gives the x and y distances. Now the trigonometric values are deﬁned exactly in the way that they were in the table above except that x and y can now take on negative values.

2.3 What Is Trigonometry?

57

So we can summarize values of these functions for common angular values (multiples of 45o ) in tabular form to show some repeating patterns. Note the interesting cycling pattern that is similar, yet different, for sine and cosine.

Here the notation “–” denotes that the value is undeﬁned and comes from a division by zero operation. The Pythagorean Theorem gives some very useful and well-known relations between the basic trigonometric functions:

Confusingly, these also have alternate terminology: arcsin(b) = sin−1 (b), arccos(b) = cos−1 (b), and arctan(b) = tan−1 (b), which can easily be confused with the inverse relationships. Furthermore, some calculators (and computer programming languages) use the equivalent terms asin, acos, and atan. So sometimes the trigonometric functions are difﬁcult in superﬁcial ways even when the basic ideas are simple. We will ﬁnd that these trigonometric functions are useful in areas of applied calculus and spatial modeling in particular.

2.3.1 Radian Measures for Trigonometric Functions Using the trigonometric functions is considerably easier and more intuitive with radians rather than degrees. It may have seemed odd that these functions “reset” at 360o , but from what we have seen of the circle it makes sense to do this at 2π for sine and cosine or at π for tangent: sin(θ + 2π) = sin(θ) cos(θ + 2π) = cos(θ) tan(θ + π) = tan(θ) csc(θ + 2π) = csc(θ) sec(θ + 2π) = sec(θ) cot(θ + π) = cot(θ).

area of research where trigonometric functions are important is in spatial modeling of the utility of increasing state-level conﬂict. One particular model used by Bueno De Mesquita and Lalman (1986), and more recently Stam (1996), speciﬁes the expected utility for conﬂict escalation radiating outward from the origin on a two-dimensional polar metric whereby θ speciﬁes an angular direction and r speciﬁes the intensity or stakes of the conﬂict. The yaxis gives Actor A’s utility, where the positive side of the origin gives expected utility increasing outcomes (A continuing to ﬁght) and the negative side of the origin gives expected utility decreasing outcomes (A withdrawing). Likewise, the x-axis gives Actor B’s expected utility, where the positive side of the origin gives expected utility increasing outcomes (B continuing to ﬁght) and the negative side of the origin gives utility decreasing outcomes (B withdrawing). Thus the value of θ determines “who wins” and “who loses,” where it is possible to have both actors receive positive (or negative) expected utility. The model thus asserts that nations make conﬂict escalation decisions based on these expected utilities for themselves as well as those assessed for their adversary. This construction is illustrated in the top panel of Figure 2.5. Even though the circle depiction is a very neat mnemonic about trade-offs, it does not show very well the consequences of θ to an individual actor. Now,

2.3 What Is Trigonometry?
Fig. 2.5. Views of Conflict Escalation

61

Actor A, Positive

r
Actor B, Negative

θ

Actor B, Positive

Actor A, Negatve

r=1.5
Expected Utility

r=1

A B

r=1 r=1.5
0 π4 π2 3π 4 π 5π 4 3π 2 7π 4 2π

θ
if we transform to Cartesian coordinates (using the formulas above), then we get the illustration in the bottom panel of Figure 2.5. This is helpful because now we can directly see the difference in expected utility between Actor A (solid line) and Actor B (dashed line) at some value of θ by taking the vertical distance between the curves at that x-axis point.

62

Analytic Geometry Since r is the parameter that controls the value of the conﬂict (i.e.,what is at

stake here may be tiny for Ecuador/Peru but huge for U.S./U.S.S.R.), then the circle in the upper panel of the ﬁgure gives the universe of possible outcomes for one scenario. The points where the circle intersects with the axes provide absolute outcomes: Somebody wins or loses completely and somebody has a zero outcome. Perhaps more intuitively, we see in the lower panel that going from r = 1 to r = 1.5 magniﬁes the scope of the positive or negative expected utility to A and B. Furthermore, we can now see that increasing r has different effects on the expected utility difference for differing values of θ, something that was not very apparent from the polar depiction because the circle obviously just increased in every direction.

2.3.2 Conic Sections and Some Analytical Geometry Before we move on to more elaborate forms of analytical geometry, let us start with a reminder of the deﬁnition of a circle in analytical geometry terms. Intuitively we know that for a center c and a radius r, a circle is the set of points that are exactly r distance away from c in a two-dimensional Cartesian coordinate system. More formally: A circle with radius r > 0 and center at point [xc , yc ] is deﬁned by the quadratic, multi-variable expression r2 = (x − xc )2 + (y − yc )2 . This is “multivalued” because x and y both determine the function value. The most common form is the unit circle, which has radius r = 1 and is centered at the origin (xc = 0, yc = 0). This simpliﬁes the more general form down to merely 1 = x2 + y 2 .

Note that r, c, xc , and yc are all ﬁxed, known constants. The key part of this √ is that r can be backed out of c above as r = 1 a2 + b2 − 4c. Recall that r is 2 the radius of the circle, so we can look at the size of the radius that results from choices of the constants. First note that if a2 + b2 = 4c, then the circle has no size and condenses down to a single point. On the other hand, if a2 + b2 < 4c, then the square root is a complex number (and not deﬁned on the Cartesian system we have been working with), and therefore the shape is undeﬁned. So the only condition that provides a circle is a2 + b2 > 4c, and thus we have a simple test. The big point here is that the equational form for a circle is always a quadratic in two variables, but this form is not sufﬁcient to guarantee a circle with positive area. Example 2.2: deﬁne a circle? x2 + y 2 − 12x + 12y + 18 = 0 Here a = −12, b = 12, and c = 18. So a2 + b2 = 144 + 144 = 288 is greater than 4c = 4 · 18 = 72, and this is therefore a circle. Testing for a Circular Form. Does this quadratic equation

64

Analytic Geometry 2.3.2.1 The Parabola

The parabola is produced by slicing through a three-dimensional cone with a two-dimensional plane such that the plane goes through the ﬂat bottom of the cone. Of course there is a more mathematical deﬁnition: A parabola in two dimensions is the set of points that are equidistant from a given (ﬁxed) point and a given (ﬁxed) line. The point is called the focus of the parabola, and the line is called the directrix of the line. Figure 2.6 shows the same parabola in two frames for a focus at [0, p] and a directrix at y = −p, for p = 1. The ﬁrst panel shows the points labeled and a single equidistant principle illustrated for a point on the parabola. The second frame shows a series of points on the left-hand side of the access with their corresponding equal line segments to p and d.

Fig. 2.6. Characteristics of a Parabola

6

4

f(x,y)
2

y

y −2 0 2

focus=(0,p)

0

0

directrix at −1
−2

−4

−2

4

6

0 x

2

4

−4

−2

0 x

2

4

Of course there is a formula that dictates the relationship between the y-axis points and the x-axis points for the parabola: y= x2 . 4p

2.3 What Is Trigonometry?

65

This is a parabola facing upward and continuing toward greater positive values of y for positive values of p. If we make p negative, then the focus will be below the x-axis and the parabola will face downward. In addition, if we swap x and y in the deﬁnitional formula, then the parabola will face left or right (depending on the sign of p; see the Exercises). A more general form of a parabola is for the parabola whose focus is not on one of the two axes: (y − y ) = (x − x )2 , 4p

which will have a focus at [x , p + y ] and a directrix at y = y − p. Example 2.3: Presidential Support as a Parabola. It is actually quite

common to ﬁnd parabolic relationships in political science research. Oftentimes the functional form is mentioned in an offhand way to describe a nonlinear relationship, but sometimes exact forms are ﬁt with models. As an example of the latter case, Stimson (1976) found that early post-war presidential support scores ﬁt a parametric curve. Presidents begin their 4 year term with large measures of popularity, which declines and then recovers before the end of their term, all in the characteristic parabolic form. Stimson took Gallup poll ratings of full presidential terms from Truman’s ﬁrst term to Nixon’s ﬁrst term and “ﬁts” (estimates) parabolic forms that best followed the data values. Fortunately, Gallup uses the same question wording for over 30 years, allowing comparison across presidents. Using the general parabolic notation above, the estimated parameters are y Truman Term 1 Truman Term 2 Eisenhower Term 1 Eisenhower Term 2 Johnson Term 2 Nixon Term 1 53.37 32.01 77.13 68.52 47.90 58.26 x 2.25 3.13 1.86 2.01 3.58 2.60 1/4p 8.85 4.28 2.47 1.43 2.85 4.61

66

Analytic Geometry Since y is the lowest point in the upward facing parabola, we can use these

values to readily see low ebbs in a given presidency and compare them as well. Not surprisingly, Eisenhower did the best in this regard. The corresponding parabolas are illustrated in Figure 2.7 by president. It is interesting to note that, while the forms differ considerably, the hypothesized shape bears out (Stimson gave a justiﬁcation of the underlying statistical work in his article). So, regardless of political party, wartime status, economic conditions, and so on, there was a persistent parabolic phenomenon for approval ratings of presidents across the four years.

Fig. 2.7. Parabolic Presidential Popularity

90

100

70

80

Eisenhower Term 1

Gallup Presidential Approval

60

Truman Term 1

Eisenhower Term 2

40

50

Truman Term 2

70

80

90

100 30

Nixon Term 1
60

Johnson Term 2

30

40

50

Q1

Q3

Q1

Q3

Q1

Q3

Q1

Q3

Q1

Q3

Q1

Q3

Q1

Q3

Q1

Q3

Time by Quarter−Years

2.3 What Is Trigonometry? 2.3.2.2 The Ellipse

67

Another fundamental conic section is produced by slicing a two-dimensional plane through a three-dimensional cone such that the plane does not cut through the ﬂat base of the cone. More precisely, an ellipse in two dimensions is the set of points whose summed distance to two points is a constant: For two foci f1 and f2 , each point pi on the ellipse has |pi − f1 | + |pi − f2 | = k (note the absolute value notation here).

Fig. 2.8. Characteristics of an Ellipse
20
20

[0,b]
10
10

[−a,0]

[a,0]
y 0

f1=[−c,0]

f2=[c,0]

−10

[0,−b]

−20

−20

−10

−20
−20

−10

y 0

0 x

10

20

−10

0 x

10

20

Things are much easier if the ellipse is set up to rest on the axes, but this is not a technical necessity. Suppose that we deﬁne the foci as resting on the x-axis at the points [−c, 0] and [c, 0]. Then with the assumption a > b, we get the standard form of the ellipse from x2 y2 + 2 = 1, where c = 2 a b a2 − b 2 .

This form is pictured in the two panels of Figure 2.8 for a = 16, b = 9. The ﬁrst panel shows the two foci as well as the four vertices of the ellipse where the ellipse reaches its extremum in the two dimensions at x = ±16 and y = ±9.

68

Analytic Geometry

The second panel shows for selected points in the upper left quadrant the line segments to the two foci. For any one of these points along the ellipse where the line segments meet, the designated summed distance of the two segments must be a constant. Can we determine this sum easily? At ﬁrst this looks like a hard problem, but notice that each of the four vertices must also satisfy this condition. We can now simply use the Pythagorean Theorem and pick one of them: k = 2 c2 + b 2 = ( a2 − b2 )2 + b2 = a.

Because this hypotenuse is only one-half of the required distance, we know that k = 2a, where a is greater than b. This also illustrates an elegant feature of the ellipse: If one picks any of the paired line segments in the second panel of Figure 2.8 and ﬂattens them out down on the x-axis below, they will exactly join the two x-axis vertices. This is called the major axis of the ellipse; the other one is called, not surprisingly, the minor axis of the ellipse. Example 2.4: Elliptical Voting Preferences. One research area where

ellipses are an important modeling construction is describing spatial models of preferences. Often this is done in a legislative setting as a way of quantifying the utility of alternatives for individual lawmakers. Suppose that Congress needs to vote on a budget for research and development spending that divides total possible spending between coal and petroleum. A hypothetical member of Congress is assumed to have an ideal spending level for each of the two projects that trades off spending in one dimension against spending in the other. As an example, the highest altitude ideal point, and therefore the mode of the preference structure, is located at [petroleum = 0.65, coal = 0.35] on a standardized metric (i.e., dollars removed). Figure 2.9 shows the example representative’s utility preference in two ways: a three-dimensional wire-frame drawing that reveals the dimensionality now present over the petroleum/coal grid (the ﬁrst panel), and a contour

2.3 What Is Trigonometry?

69

plot that illustrates the declining levels of utility preference, U1 , U2 , U3 (the second panel). So the further away from [0.65, 0.35], the less happy the individual is with any proposed bill: She has a lower returned utility of a spending level at [0.2, 0.8] than at [0.6, 0.4]. By this means, any point outside a given contour provides less utility than all the points inside this contour, no matter what direction from the ideal point.
Fig. 2.9. Multidimensional Issue Preference
1.0

0.8

Coal

0.6

0.4

0.4 0.0 0.2

0.6
c1

0.8
c1

0.0

0.2

1.0

p1
0.0 0.2 0.4 0.6 0.8 1.0

p1
0.0 0.2 0.4 0.6 0.8 1.0

Petroleum

Petroleum

But wait a minute. Is it realistic to assume that utility declines in exactly the same way for the two dimensions? Suppose our example legislator was adamant about her preference on coal and somewhat ﬂexible about petroleum. Then the resulting generalization of circular preferences in two dimensions would be an ellipse (third panel) that is more elongated in the petroleum direction. The fact that the ellipse is a more generalized circle can be seen by comparing the equation of a circle centered at the origin with radius 1 (1 = x2 + y 2 ) to the standard form of the ellipses above. If a = b, then c = 0 and there is only one focus, which would collapse the ellipse to a circle. 2.3.2.3 The Hyperbola The third fundamental conic section comes from a different visualization. Suppose that instead of one cone we had two that were joined at the tip such that the ﬂat planes at opposite ends were parallel. We could then take this threedimensional setup of two cones “looking at each other” and slice through it

70

Analytic Geometry

with a plane that cuts equally through each of the ﬂat planes for the two cones. This will produce a hyperbola, which is the set of points such that the diﬀerence between two foci f1 and f2 is a constant. We saw with the ellipse that |pi − f1 | + |pi − f2 | = 2a, but now we assert that |pi − f1 | − |pi − f2 | = 2a for each point on the hyperbola. If the hyperbola is symmetric around the origin with open ends facing vertically (as in Figure 2.10), then f1 c, f2 = −c, and the standard form of the hyperbola is given by y2 x2 − 2 = 1, where c = a2 b are different. a2 + b 2 .

Notice that this is similar to the equation for the ellipse except that the signs

Fig. 2.10. Characteristics of a Hyperbola
20
20

f1
10
10

v1

y 0

v2
−10
−10

f2

−20

−20

−10

−20
−20

y 0

0 x

10

20

−10

0 x

10

20

As you can see from Figure 2.10, a hyperbola is actually two separate curves in 2-space, shown here for a = 9 and b = 8 in the standard form. As with the previous two ﬁgures, the ﬁrst panel shows the individual points of interest, the foci and vertexes, and the second panel shows a subset of the segments that deﬁne the hyperbola. Unfortunately these segments are not quite as visually

2.3 What Is Trigonometry?

71

intuitive as they are with the ellipse, because the hyperbola is deﬁned by the difference of the two connected segments rather than the sum. Example 2.5: Hyperbolic Discounting in Evolutionary Psychology and

Behavioral Economics. There is a long literature that seeks to explain why people make decisions about immediate versus deferred rewards [see the review in Frederick et al. (2002)]. The classic deﬁnition of this phenomenon is that of B¨ m-Bawerk (1959): “Present goods have in general greater subjeco tive value than future (and intermediate) goods of equal quantity and quality.” While it seems clear that everybody prefers 100 today rather than in one year, it is not completely clear what larger value is required to induce someone to wait a year for the payment. That is, what is an appropriate discounting schedule that reﬂects general human preferences and perhaps accounts for differing preferences across individuals or groups. This is a question of human and social cognitive and emotional affects on economic decisions.
¢

One way to mathematically model these preferences is with a declining curve that reﬂects discounted value as time increases. A person might choose to wait one week to get 120 instead of 100 but not make such a choice if the wait was one year. The basic model is attributed to Samuelson (1937), who framed these choices in discounted utility terms by specifying (roughly) a positive constant followed by a positive discount rate declining over time. This can be linear or curvilinear, depending on the theory concerned. The ﬁrst (and perhaps obvious) mathematical model for such discounted utility is the exponential function φ(t) = exp(−βt), where t is time and β is a parameter that alters the rate of decay. This is illustrated in Figure 2.11.
¢ ¢

72

Analytic Geometry
Fig. 2.11. Exponential Curves

φ(t) = exp(−βt)

1.0

β = 1.5 β=2

0.8

β=4 β = 10
0.6 0.0 0.2 0.4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

While the exponential function is simple and intuitive, it apparently does not explain discounting empirically because with only one parameter to control the shape it is not ﬂexible enough to model observed human behavior. Recall that the exponentiation here is just raising 2.718282 to the −βt power. Because there is nothing special about the constant, why not parameterize this as well? We can also add ﬂexibility to the exponent by moving t down and adding a second parameter to control the power; the result is φ(t) = ±(γt)−β/γ .

This is actually a hyperbolic functional form for t = x and φ(t) = y, in a more general form than we saw above. To illustrate this as simply as possible, set γ = 1, β = 1, and obtain the plot in Figure 2.12. So this is just

2.3 What Is Trigonometry?
Fig. 2.12. Derived Hyperbola Form

73

φ(t) = +/−(γ t)(−β γ)
6 −6 −4 −2 0 2 4

−6

−4

−2

0

2

4

6

like the hyperbola that we saw before, except that it is aligned with axes such that the two curves are conﬁned to the ﬁrst and third quadrants. But wait. The phenomenon that we are interested in (discounting) is conﬁned to R+ , so we only need the form in the upper right quadrant. Also, by additively placing a 1 inside the parentheses we can standardize the expression such that discounting at time zero is 1 (i.e., people do not have a reduction in utility for immediate payoff) for every pair of parameter values. Loewenstein and Prelec (1992) used the constant to make sure that the proposed hyperbolic curves all intersect at the point (0,0.3). The constant is arbitrary, and the change makes it very easy to compare different hyperbolic functions, as in Figure 2.13. Interestingly, as γ gets very small and approaches zero, the hyperbolic function above becomes the previous exponential form.

74

Analytic Geometry
Fig. 2.13. General Hyperbola Form for Discounting

φ(t) = (1 + γ t)(−β γ)
1.0

γ=8, β=2 γ = 0.1 , β = 0.33 γ=1, β=1 γ=6, β=8

0.0

0.2

0.4

0.6

0.8

0

2

4

6

8

The existing literature on utility discounting over time is quite large and stretches far back in time. There seems to be a majority that have settled on the hyperbolic form as the best alternative given empirical data [see Loewenstein and Prelec (1992), Harvey (1986), and Weitzman (2001), just to name a few recent examples].

Data analysts sometimes ﬁnd it useful to transform variables into more convenient forms. For instance, data that are bounded by [0 : 1] can sometimes be inconvenient to model due to the bounds. Some transformations that address this (and can be easily undone to revert back to the original form) are inverse logit: f (x) = log(x/(1 − x)) cloglog: f (x) = log(− log(1 − x)) √ arc sine: f (x) = arcsin( x). For x ∈ [0:1], what advantage does the arcsine transformation have? Graph each of these functions over this domain in the same graph. If you were modeling the underlying preference or utility for a dichotomous choice (0 and 1, such as a purchase or a vote), which form would you prefer? Write the inverse function to the arcsine transformation. Note: The arcsine transformation has been useful in analysis of variance (ANOVA) statistical tests when distributional assumptions are violated.

2.10

Lalman (1988) extended the idea using polar coordinates to model the conﬂict between two nations discussed on page 60 by deﬁning circumstances whereby the actors escalate disputes according to the ordered pair of expected utilities: s = {Ei (Uij ), Ej (Uij )}, which deﬁne the expected utility of i against j and the expected utility of

78

Analytic Geometry j against i. These are plotted in polar coordinates [as in Bueno De Mesquita and Lalman (1986)], where the ﬁrst utility is given along the x-axis and the second along the y-axis. As an example, Lalman gave the following probabilities of escalation by the two actors: p(Esci ) = 0.4(r cos θ) + 0.5 p(Escj ) = 0.4(r sin θ) + 0.5,

Using your plot, answer the following questions: (a) How do you interpret his differing substantive regions along the x-axis? (b) Is there a region where the probability of violence is high but the probability of war is low? (c) Why is the probability of intervention high in the two regions where the probability of war is low? (d) What happens when r < 1? How do you interpret this substantively in terms of international politics? 2.11 Suppose you had an ellipse enclosed within a circle according to: ellipse: y2 x2 + 2 = 1, a2 b a>b

circle: x2 + y 2 = a2 . What is the area of the ellipse? 2.12 Fourier analysis redeﬁnes a function in terms of sine and cosine functions by assuming a periodicity in the data. This is a useful tool in time series analysis where reoccuring effects are of interest. The basic Fourier function is given by f (x) = 1 a0 + (an cos(nx) + bn sin(nx)), 2 n=1
∞

Yeats, Irons, and Rhoades (1975) found that annual deposit growth for 48 commercial banks can be modeled by the function: Dt+1 = 1.172 − 0.125t−1 + 1.135t−2, Dt where D is year-end deposits and t is years. Graph this equation for 20 years and identify the form of the curve.

2.14

Southwood (1978) was concerned with the interpretation (and misinterpretation) of interaction effects in sociological models. These are models where explanatory factors are assumed to contribute to some outcome in a manner beyond additive forms. For example, a family’s socioeconomic status is often assumed to interact with the child’s intelligence to affect occupational or educational attainment in a way that is beyond the simple additive contribution of either. In explaining interaction between two such contributing variables (X1 and X2 here), Southwood looked at the relations in Figure 2.15 below. It turns out that the value of θ is critical to making statements about the interactivity of X1 and X2. Fill in the missing values from the following statements where two capital letters indicates the length of the associate segment: (a) OR = QR (b) OR = θ. cot θ. .

(c) QP = OT − OR = X1 cot (d) SP = X1 θ − X2 θ.

Exercises
Fig. 2.15. Derivation of the Distance

81

S

X2
Q

W P

O U

θ

R

X1

T

Give the equation for the line along OS, and give the equation for the line along U W in terms of values of X1, θ, and SP (i.e., the length of the segment from S to P ). 2.15 In studying the labor supply of nurses, Manchester (1976) deﬁned ν as the wage and H(ν) as the units of work per time period (omitting a constant term, which is unimportant to the argument). He gave two possible explanatory functions: H(ν) = aν 2 + bν with constants a < 0, b >, and H(ν) = a + b/ν with constants a unconstrained, b < 0. Which of these is a hyperbolic function? What is the form of the other?

3
Linear Algebra: Vectors, Matrices, and Operations

3.1 Objectives This chapter covers the mechanics of vector and matrix manipulation and the next chapter approaches the topic from a more theoretical and conceptual perspective. The objective for readers of this chapter is not only to learn the mechanics of performing algebraic operations on these mathematical forms but also to start seeing them as organized collections of numerical values where the manner in which they are put together provides additional mathematical information. The best way to do this is to perform the operations oneself. Linear algebra is fun. Really! In general, the mechanical operations are nothing more than simple algebraic steps that anybody can perform: addition, subtraction, multiplication, and division. The only real abstraction required is “seeing” the rectangular nature of the objects in the sense of visualizing operations at a high level rather than getting buried in the algorithmic details. When one reads high visibility journals in the social sciences, matrix algebra (a near synonym) is ubiquitous. Why is that? Simply because it lets us express extensive models in quite readable notation. Consider the following linear statistical model speciﬁcation [from real work, Powers and Cox (1997)]. They are relating political blame to various demographic and regional political 82

3.2 Working with Vectors variables: for i = 1 to n, (BLAM EF IRST )Yi =

This expression is way too complicated to be useful! It would be easy for a reader interested in the political argument to get lost in the notation. In matrix algebra form this is simply Y = Xβ + . In fact, even for very large datasets and very large model speciﬁcations (many data variables of interest), this form is exactly the same; we simply indicate the size of these objects. This is not just a convenience (although it really is convenient). Because we can notate large groups of numbers in an easy-to-read structural form, we can concentrate more on the theoretically interesting properties of the analysis. While this chapter provides many of the foundations for working with matrices in social sciences, there is one rather technical omission that some readers may want to worry about later. All linear algebra is based on properties that deﬁne a ﬁeld. Essentially this means that logical inconsistencies that could have otherwise resulted from routine calculations have been precluded. Interested readers are referred to Billingsley (1995), Chung (2000), or Grimmett and Stirzaker (1992).

3.2 Working with Vectors Vector. A vector is just a serial listing of numbers where the order matters. So

84

Linear Algebra: Vectors, Matrices, and Operations

we can store the ﬁrst four positive integers in a single vector, which can be ⎤ ⎡ 1 ⎥ ⎢ ⎥ ⎢ ⎢ 2 ⎥ ⎥, ⎢ a row vector: v = [1, 2, 3, 4], or a column vector: v = ⎢ ⎥ ⎢ 3 ⎥ ⎦ ⎣ 4 where v is the name for this new object. Order matters in the sense that the two vectors above are different, for instance, from ⎡ ⎤

v∗ = [4, 3, 2, 1],

⎥ ⎢ ⎥ ⎢ ⎢ 2 ⎥ ∗ ⎥. ⎢ v =⎢ ⎥ ⎢ 3 ⎥ ⎦ ⎣ 1

4

It is a convention that vectors are designated in bold type and individual values, scalars, are designated in regular type. Thus v is a vector with elements v1 , v2 , v3 , v4 , and v would be some other scalar quantity. This gets a little confusing where vectors are themselves indexed: v1 , v2 , v3 , v4 would indicate four vectors, not four scalars. Usually, however, authors are quite clear about which form they mean. Substantively it does not matter whether we consider a vector to be of column or row form, but it does matter when performing some operations. Also, some disciplines (notably economics) tend to default to the column form. In the row form, it is equally common to see spacing used instead of commas as delimiters: [1 2 3 4]. Also, the contents of these vectors can be integers, rational or irrational numbers, and even complex numbers; there are no restrictions. So what kinds of operations can we do with vectors? The basic operands are very straightforward: addition and subtraction of vectors as well as multiplication and division by a scalar. The following examples use the vectors u = [3, 3, 3, 3] and v = [1, 2, 3, 4] Example 3.1: Vector Addition Calculation.

u + v = [u1 + v1 , u2 + v2 , u3 + v3 , u4 + v4 ] = [4, 5, 6, 7].

3.2 Working with Vectors

85

Example 3.2:

Vector Subtraction Calculation.

u − v = [u1 − v1 , u2 − v2 , u3 − v3 , u4 − v4 ] = [2, 1, 0, −1].

Example 3.3:

Scalar Multiplication Calculation.

3 × v = [3 × v1 , 3 × v2 , 3 × v3 , 3 × v4 ] = [3, 6, 9, 12].

Example 3.4:

Scalar Division Calculation. 4 1 2 , , 1, . 3 3 3

v ÷ 3 = [v1 /3, v2 /3, v3 /3, v4 /3] =

So operations with scalars are performed on every vector element in the same way. Conversely, the key issue with addition or subtraction between two vectors is that the operation is applied only to the corresponding vector elements as pairs: the ﬁrst vector elements together, the second vector elements together, and so on. There is one concern, however. With this scheme, the vectors have to be exactly the same size (same number of elements). This is called conformable in the sense that the ﬁrst vector must be of a size that conforms with the second vector; otherwise they are (predictably) called nonconformable. In the examples above both u and v are 1 × 4 (row) vectors (alternatively called length k = 4 vectors), meaning that they have one row and four columns. Sometimes size is denoted beneath the vectors:
1×4

u + v .
1×4

It should then be obvious that there is no logical way of adding a 1 × 4 vector to a 1 × 5 vector. Note also that this is not a practical consideration with scalar multiplication or division as seen above, because we apply the scalar identically to each element of the vector when multiplying: s(u1 , u2 , . . . , uk ) = (su1 , su2 , . . . , suk ).

86

Linear Algebra: Vectors, Matrices, and Operations

There are a couple of “special” vectors that are frequently used. These are 1 and 0, which contain all 1’s or 0’s, respectively. As well shall soon see, there are a larger number of “special” matrices that have similarly important characteristics. It is easy to summarize the formal properties of the basic vector operations. Consider the vectors u, v, w, which are identically sized, and the scalars s and t. The following intuitive properties hold.

Multiplication of vectors is not quite so straightforward,and there are actually different forms of multiplication to make matters even more confusing. We will start with the two most important and save some of the other forms for the last section of this chapter. Vector Inner Product. The inner product, also called the dot product, of two vectors, results in a scalar (and so it is also called the scalar product). The inner product of two conformable vectors of arbitrary length k is the sum of the item-by-item products:
k

u · v = [u1 v1 + u2 v2 + · · · uk vk ] =
i=1

ui vi .

It might be somewhat surprising to see the return of the summation notation here (Σ, as described on page 11), but it makes a lot of sense since running through the two vectors is just a mechanical additive process. For this reason, it is relatively common, though possibly confusing, to see vector (and later matrix) operations expressed in summation notation. Example 3.6: Simple Inner Product Calculation. A numerical example

of an inner product multiplication is given by u · v = [3, 3, 3] · [1, 2, 3] = [3 · 1 + 3 · 2 + 3 · 3] = 18. When the inner product of two vectors is zero, we say that the vectors are orthogonal, which means they are at a right angle to each other (we will be more visual about this in Chapter 4). The notation for the orthogonality of two vectors is u ⊥ v iff u·v = 0. As an example of orthogonality, consider u = [1, 2, −3],

88

Linear Algebra: Vectors, Matrices, and Operations

and v = [1, 1, 1]. As with the more basic addition and subtraction or scalar operations, there are formal properties for inner products:

Vector Cross Product. The cross product of two vectors (sometimes called the outer product, although this term is better reserved for a slightly different operation; see the distinction below) is slightly more involved than the inner product, in both calculation and interpretation. This is mostly because the result is a vector instead of a scalar. Mechanically, the cross product of two conformable vectors of length k = 3 is u × v = [u2 v3 − u3 v2 , u3 v1 − u1 v3 , u1 v2 − u2 v1 ] , meaning that the ﬁrst element is a difference equation that leaves out the ﬁrst elements of the original two vectors, and the second and third elements proceed accordingly. In the more general sense, we perform a series of “leave one out” operations that is more extensive than above because the suboperations are themselves cross products.
Fig. 3.1. Vector Cross Product Illustration

u1 v1

u2 v2

u3 v3
£

w1
¤ ¤

Figure 3.1 gives the intuition behind this product. First the u and v vectors are stacked on top of each other in the upper part of the illustration. The process of calculating the ﬁrst vector value of the cross product, which we will call w1 , is done by “crossing” the elements in the solid box: u2 v3 indicated by the ﬁrst arrow and u3 v2 indicated by the second arrow. Thus we see the result for

¤

£

¤

£

w1 = u2 v3 − v3 u2

w2

w3

90

Linear Algebra: Vectors, Matrices, and Operations

Fig. 3.2. The Right-Hand Rule Illustrated

w

v u

w1 as a difference between these two individual components. This is actually the determinant of the 2 × 2 submatrix, which is an important principle considered in some detail in Chapter 4. Interestingly, the resulting vector from a cross product is orthogonal to both of the original vectors in the direction of the so-called “right-hand rule.” This handy rule says that if you hold your hand as you would when hitchhiking, the curled ﬁngers make up the original vectors and the thumb indicates the direction of the orthogonal vector that results from a cross product. In Figure 3.2 you can imagine your right hand resting on the plane with the ﬁngers curling to the left ( ) and the thumb facing upward. For vectors u, v, w, the cross product properties are given by

Sometimes the distinction between row vectors and column vectors is important. While it is often glossed over, vector multiplication should be done in a conformable manner with regard to multiplication (as opposed to addition discussed above) where a row vector multiplies a column vector such that their adjacent “sizes” match: a (1 × k) vector multiplying a (k × 1) vector for k

This adjacency above comes from the k that denotes the columns of v and the k that denotes the rows of u and manner by which they are next to each other. Thus an inner product multiplication operation is implied here, even if it is not directly stated. An outer product would be implied by this type of adjacency: ⎤ ⎡ u1 ⎥ ⎢ ⎥ ⎢ ⎢ u2 ⎥ ⎥ ⎢ ⎢ . ⎥ × [v1 , v2 , . . . , vk ], 1×k ⎢ . ⎥ . ⎦ ⎣ uk
k×1

where the 1’s are next to each other. So the cross product of two vectors is a vector, and the outer product of two conformable vectors is a matrix: a rectangular grouping of numbers that generalizes the vectors we have been working with up until now. This distinction helps us to keep track of the objective. Mechanically, this is usually easy. To be completely explicit about these operations we can also use the vector transpose, which simply converts a row vector to a column vector, or vice versa, using the apostrophe notation: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ u1 u2 . . . uk
k×1

⎤ ⎥ ⎥ ⎥ ⎥ = [u1 , u2 , . . . , uk ], ⎥ 1×k ⎥ ⎦

⎡ ⎢ ⎢ ⎢ [u1 , u2 , . . . , uk ] = ⎢ ⎢ 1×k ⎢ ⎣

u1 u2 . . . uk
k×1

⎤ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎦

This is essentially book-keeping with vectors and we will not worry about it extensively in this text, but as we will see shortly it is important with matrix operations. Also, note that the order of multiplication now matters.

3.2.1 Vector Norms Measuring the “length” of vectors is a surprisingly nuanced topic. This is because there are different ways to consider Cartesian length in the dimension implied by the size (number of elements) of the vector. It is obvious, for instance, that (5, 5, 5) should be considered longer than (1, 1, 1), but it is not clear whether (4, 4, 4) is longer than (3, −6, 3). The standard version of the vector norm for an n-length vector is given by
2 2 2 v = (v1 + v2 + · · · + vn ) 2 = (v · v) 2 .
1 1

In this way, the vector norm can be thought of as the distance of the vector from the origin. Using the formula for v we can now calculate the vector norm for (4, 4, 4) and (3, −6, 3): (4, 4, 4) = (3, −6, 3) = 42 + 42 + 42 = 6.928203 32 + (−6)2 + 32 = 7.348469.

So the second vector is actually longer by this measure. Consider the following properties of the vector norm (notice the reoccurrence of the dot product in the Multiplication Form):

Interestingly, norming can also be applied to ﬁnd the n-dimensional distance between the endpoints of two vectors starting at the origin with a variant of the Pythagorean Theorem known as the law of cosines: v−w
2

= v

2

+ w

2

−2 v

w cos θ,

where θ is the angle from the w vector to the v vector measured in radians. This is also called the cosine rule and leads to the property that cos(θ) = Example 3.12:
v vw w

.

Votes in the House of Commons. Casstevens (1970)

looked at legislative cohesion in the British House of Commons. Prime Minister David Lloyd George claimed on April 9, 1918 that the French Army was stronger on January 1, 1918 than on January 1, 1917 (a statement that generated considerable controversy). Subsequently the leader of the Liberal Party moved that a select committee be appointed to investigate claims by the military that George was incorrect. The resulting motion was defeated by the following vote: Liberal Party 98 yes, 71 no; Labour Party 9 yes, 15 no; Conservative Party 1 yes, 206 no; others 0 yes, 3 no. The difﬁcult in analyzing this vote is the fact that 267 Members of Parliament (MPs) did not vote. So do we include them in the denominator when making claims about

96

Linear Algebra: Vectors, Matrices, and Operations

voting patterns? Casstevens says no because large numbers of abstentions mean that such indicators are misleading. He alternatively looked at party cohesion for the two large parties as vector norms: L = (98, 71) = 121.0165 C = (1, 206) = 206.0024. From this we get the obvious conclusion that the Conservatives are more cohesive because their vector has greater magnitude. More interestingly, we can contrast the two parties by calculating the angle between these two vectors (in radians) using the cosine rule: θ = arccos (98, 71) · ((1, 206) = 0.9389, 121.070 × 206.002

which is about 54 degrees. Recall that arccos is the inverse function to cos. It is hard to say exactly how dramatic this angle is, but if we were analyzing a series of votes in a legislative body, this type of summary statistic would facilitate comparisons.

that is, just the maximum vector value. Whenever a vector has a p-norm of 1, it is called a unit vector. In general, if p is left off the norm, then one can safely assume that it is the p = 2 form discussed above. Vector p-norms have the following properties:

known to be an effective policy tool for democratic governments, it is also a very difﬁcult political solution for many politicians because it can be unpopular and controversial. Swank and Steinmo (2002) looked at factors that lead to changes in tax policies in “advanced capitalist” democracies with the idea that factors like internationalization of economies, political pressure

98

Linear Algebra: Vectors, Matrices, and Operations

from budgets, and within-country economic factors are inﬂuential. They found that governments have a number of constraints on their ability to enact signiﬁcant changes in tax rates, even when there is pressure to increase economic efﬁciency. As part of this study the authors provided a total taxation from labor and consumption as a percentage of GDP in the form of two vectors: one for 1981 and another for 1995. These are reproduced as Nation Australia Austria Belgium Canada Denmark Finland France Germany Ireland Italy Japan Netherlands New Zealand Norway Sweden Switzerland United Kingdom United States 1981 30 44 45 35 45 38 42 38 33 31 26 44 34 49 50 31 36 29 1995 31 42 46 37 51 46 44 39 34 41 29 44 38 46 50 34 36 28

A natural question to ask is, how much have taxation rates changed over the 14-year period for these countries collectively? The difference in mean averages, 38 versus 40, is not terribly revealing because it “washes out” important differences since some countries increased and other decreased. That is, what does a 5% difference in average change in total taxation over GDP say about how these countries changed as a group when some countries

3.2 Working with Vectors

99

changed very little and some made considerable changes? Furthermore, when changes go in opposite directions it lowers the overall sense of an effect. In other words, summaries like averages are not good measures when we want some sense of net change. One way of assessing total country change is employing the difference norm to compare aggregate vector difference.

So what does this mean? For comparison, we can calculate the same vector norm except that instead of using t1995 , we will substitute a vector that increases the 1981 uniformly levels by 10% (a hypothetical increase of 10% for every country in the study): ˆ t1981 = 1.1t1981 = [33.0, 48.4, 49.5, 38.5, 49.5, 41.8, 46.2, 41.8, 36.3 34.1, 28.6, 48.4, 37.4, 53.9, 55.0, 34.1, 39.6, 31.9].

100

Linear Algebra: Vectors, Matrices, and Operations

This allows us to calculate the following benchmark difference: ˆ ||t1981 − t1981 ||2 = 265.8. So now it is clear that the observed vector difference for total country change from 1981 to 1995 is actually similar to a 10% across-the-board change rather than a 5% change implied by the vector means. In this sense we get a true multidimensional sense of change.

3.3 So What Is the Matrix? Matrices are all around us: A matrix is nothing more than a rectangular arrangement of numbers. It is a way to individually assign numbers, now called matrix elements or entries, to speciﬁed positions in a single structure, referred to with a single name. Just as we saw that the order in which individual entries appear in the vector matters, the ordering of values within both rows and columns now matters. It turns out that this requirement adds a considerable amount of structure to the matrix, some of which is not immediately apparent (as we will see). Matrices have two deﬁnable dimensions, the number of rows and the number of columns, whereas vectors only have one, and we denote matrix size by row × column. Thus a matrix with i rows and j columns is said to be of dimension i × j (by convention rows comes before columns). For instance, a simple (and rather uncreative) 2 × 2 matrix named X (like vectors, matrix names are bolded) is given by: ⎡
2×2

X =⎣

1 2 3 4

⎤ ⎦.

Note that matrices can also be explicitly notated with size. Two things are important here. First, these four numbers are now treated together as a single unit: They are grouped together in the two-row by twocolumn matrix object. Second, the positioning of the numbers is speciﬁed.

as well as many others. Like vectors, the elements of a matrix can be integers, real numbers, or complex numbers. It is, however, rare to ﬁnd applications that call for the use of matrices of complex numbers in the social sciences. The matrix is a system. We can refer directly to the speciﬁc elements of a matrix by using subscripting of addresses. So, for instance, the elements of X are given by x11 = 1, x12 = 2, x21 = 3, and x22 = 4. Obviously this is much more powerful for larger matrix objects and we can even talk about arbitrary sizes. The element addresses of a p × n matrix can be described for large values of p and n by ⎡ x11 x12 ⎢ ⎢ x22 ⎢ x21 ⎢ ⎢ . . . ⎢ . . ⎢ . ⎢ . . . X=⎢ . . ⎢ . ⎢ . ⎢ . . ⎢ . . ⎢ . ⎢ ⎢ x(n−1)1 x(n−1)2 ⎣ xn1 xn2 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

Using this notation we can now deﬁne matrix equality. Matrix A is equal to matrix B if and only if every element of A is equal to the corresponding element of B: A = B ⇐⇒ aij = bij ∀i, j. Note that “subsumed” in this deﬁnition is the requirement that the two matrices be of the same dimension (same number of rows, i, and columns, j). order-k

3.3.1 Some Special Matrices There are some matrices that are quite routinely used in quantitative social science work. The most basic of these is the square matrix, which is, as the

102

Linear Algebra: Vectors, Matrices, and Operations

name implies, a matrix with the same number of rows and columns. Because one number identiﬁes the complete size of the square matrix, we can say that a k × k matrix (for arbitrary size k) is a matrix of order-k. Square matrices can contain any values and remain square: The square property is independent of the contents. A very general square matrix form is the symmetric matrix. This is a matrix that is symmetric across the diagonal from the upper left-hand corner to the lower right-hand corner. More formally, X is a symmetric matrix iff aij = aji ∀i, j. Here is an unimaginative example of a symmetric matrix: ⎤ ⎡ 1 2 3 4 ⎥ ⎢ ⎥ ⎢ ⎢ 2 8 5 6 ⎥ ⎥. ⎢ X=⎢ ⎥ ⎢ 3 5 1 7 ⎥ ⎦ ⎣ 4 6 7 8 A matrix can also be skew-symmetric if it has the property that the rows and column switching operation would provide the same matrix except for the sign. For example, ⎡ ⎢ ⎢ X=⎢ ⎣ ⎤

0 1 −2

−1 0 3

⎥ ⎥ −3 ⎥ . ⎦ 0

2

By the way, the symmetric property does not hold for the other diagonal, the one from the upper right-hand side to the lower left-hand side. Just as the symmetric matrix is a special case of the square matrix, the diagonal matrix is a special case of the symmetric matrix (and therefore of the square matrix, too). A diagonal matrix is a symmetric matrix with all zeros on the off-diagonals (the values where i = j). If the (4 × 4) X matrix above were a diagonal matrix, it would look like ⎡ 1 ⎢ ⎢ ⎢ 0 X=⎢ ⎢ ⎢ 0 ⎣ 0 ⎤

A diagonal matrix can have any values on the diagonal, but all of the other values must be zero. A very important special case of the diagonal matrix is the identity matrix, which has only the value 1 for each diagonal element: di = 1, ∀i. A 4 × 4 version is ⎡ 1 0 0 1 0 0 1 0 0 0 ⎤

⎢ ⎢ ⎢ 0 I=⎢ ⎢ ⎢ 0 ⎣ 0

⎥ ⎥ 0 ⎥ ⎥. ⎥ 0 ⎥ ⎦ 1

This matrix form is always given the name I, and it is sometimes denoted to give size: I4×4 or even just I(4). A seemingly similar, but actually very different, matrix is the J matrix, which consists of all 1’s: ⎡ 1 1 1 1 ⎤

⎢ ⎢ ⎢ 1 1 J=⎢ ⎢ ⎢ 1 1 ⎣ 1 1

⎥ ⎥ 1 1 ⎥ ⎥, ⎥ 1 1 ⎥ ⎦ 1 1

given here in a 4 × 4 version. As we shall soon see, the identity matrix is very commonly used because it is the matrix equivalent of the scalar number 1, whereas the J matrix is not (somewhat surprisingly). Analogously, the zero

where LT designates “lower triangular” and U T designates “upper triangular.” This general form plays a special role in matrix decomposition: factoring matrices into multiplied components. This is also a common form in more pedestrian circumstances. Map books often tabulate distances between sets of cities in an upper triangular or lower triangular form because the distance from Miami to New York is also the distance from New York to Miami. Example 3.15: Marriage Satisfaction. Sociologists who study marriage

often focus on indicators of self-expressed satisfaction. Unfortunately marital satisfaction is sufﬁciently complex and sufﬁciently multidimensional that single measurements are often insufﬁcient to get a full picture of underlying attitudes. Consequently, scholars such as Norton (1983) ask multiple

3.3 So What Is the Matrix?

105

questions designed to elicit varied expressions of marital satisfaction and therefore care a lot about the correlation between these. A correlation (described in detail in Chapter 8) shows how “tightly” two measures change with each other over a range from −1 to 1, with 0 being no evidence of moving together. His correlation matrix provides the correlational structure between answers to the following questions according to scales where higher numbers mean that the respondent agrees more (i.e., 1 is strong disagreement with the statement and 7 is strong agreement with the statement). The questions are

Question We have a good marriage My relationship with my partner is very stable Our marriage is strong My relationship with my partner makes me happy I really feel like part of a team with my partner The degree of happiness, everything considered

Measurement Scale 7-point 7-point 7-point 7-point 7-point 10-point

Valid Cases 428 429 429 429 426 407

Since the correlation between two variables is symmetric, it does not make sense to give a correlation matrix between these variables across a full matrix because the lower triangle will simply mirror the upper triangle and make the display more congested. Consequently, Norton only needs to show a triangular version of the matrix:

Interestingly, these analyzed questions all correlate highly (a 1 means a perfectly positive relationship). The question that seems to covary greatly with the others is the ﬁrst (it is phrased somewhat as a summary, after all). Notice that strength of marriage and part of a team covary less than any others (a suggestive ﬁnding). This presentation is a bit different from an upper triangular matrix in the sense discussed above because we have just deliberately omitted redundant information, rather than the rest of matrix actually having zero values.

3.4 Controlling the Matrix As with vectors we can perform arithmetic and algebraic operations on matrices. In particular addition, subtraction, and scalar operations are quite simple. Matrix addition and subtraction are performed only for two conformable matrices by performing the operation on an element-by-element basis for corresponding elements, so the number of rows and columns must match. Multiplication or division by a scalar proceeds exactly in the way that it did for vectors by affecting each element by the operation.

One special case is worth mentioning. A common implied scalar multiplication is the negative of a matrix, −X. This is a shorthand means for saying that every matrix element in X is multiplied by −1.

108

Linear Algebra: Vectors, Matrices, and Operations

These are the most basic matrix operations and obviously consist of nothing more than being careful about performing each individual elemental operation. As with vectors, we can summarize the arithmetic properties as follows.

Matrix multiplication is necessarily more complicated than these simple operations. The ﬁrst issue is conformability. Two matrices are conformable for multiplication if the number of columns in the ﬁrst matrix match the number of rows in the second matrix. Note that this implies that the order of multiplication matters with matrices. This is the ﬁrst algebraic principle that deviates from the simple scalar world that we all learned early on in life. To be speciﬁc, suppose that X is size k × n and Y is size n × p. Then the multiplication operation given by
(k×n)(n×p)

X

Y

is valid because the inner numbers match up, but the multiplication operation given by
(n×p)(k×n)

Y

X

is not unless p = k. Furthermore, the inner dimension numbers of the operation determine conformability and the outer dimension numbers determine the size of the resulting matrix. So in the example of XY above, the resulting matrix would be of size k × p. To maintain awareness of this order of operation

110

Linear Algebra: Vectors, Matrices, and Operations

distinction, we say that X pre-multiplies Y or, equivalently, that Y postmultiplies X. So how is matrix multiplication done? In an attempt to be somewhat intuitive, we can think about the operation in vector terms. For Xk×n and Yn×p , we take each of the n row vectors in X and perform a vector inner product with the n column vectors in Y. This operation starts with performing the inner product of the ﬁrst row in X with the ﬁrst column in Y and the result will be the ﬁrst element of the product matrix. Consider a simple case of two arbitrary 2 × 2 matrices: ⎡ XY = ⎣ ⎡ =⎣ ⎡ =⎣ x11 x21 x12 x22 ⎤⎡ ⎦⎣ y11 y21 y12 y22 ⎤ ⎦ ⎤ ⎦

Perhaps we can make this more intuitive visually. Suppose that we notate the four values of the ﬁnal matrix as XY[1, 1], XY[1, 2], XY[2, 1], XY[2, 2] corresponding to their position in the 2 × 2 product. Then we can visualize how the rows of the ﬁrst matrix operate against the columns of the second matrix to produce each value:

While it helps to visualize the process in this way, we can also express the product in a more general, but perhaps intimidating, scalar notation for an arbitrary-sized operation: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ X Y =⎢ ⎢ (k×n)(n×p) ⎢ ⎢ ⎢ ⎣
n i=1 n i=1

be decomposed as the product of lower and upper triangular matrices. This is a very general ﬁnding that we will return to and extend in the next chapter. The principle works like this for the matrix A:
(p×p)

A = L

(p×p)(p×p)

U ,

where L is a lower triangular matrix and U is an upper triangular matrix (sometimes a permutation matrix is also required; see the explanation of permutation matrices below). Consider the following example matrix decomposition according to this scheme: ⎡ ⎤ ⎡ ⎤⎡ ⎤

⎢ ⎢ ⎢ 1 ⎣ 1

2

3 2

⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ 1 0 ⎥⎢ 0 9 ⎥ = ⎢ 0.5 ⎦⎣ ⎦ ⎣ 0 0.5 −1 1 1 12

3

1.0

0

0

2

3.0 0.5

⎥ ⎥ 7.5 ⎥ . ⎦ 0.0 18.0

3.0

This decomposition is very useful for solving systems of equations because much of the mechanical work is already done by the triangularization. Now that we have seen how matrix multiplication is performed, we can return to the principle that pre-multiplication is different than post-multiplication. In

3.4 Controlling the Matrix

113

the case discussed we could perform one of these operations but not the other, so the difference was obvious. What about multiplying two square matrices? Both orders of multiplication are possible, but it turns out that except for special cases the result will differ. In fact, we need only provide one particular case to prove this point. Consider the matrices X and Y: ⎡ ⎤⎡ ⎤ ⎡ ⎤ 1 2 0 1 2 1 ⎦⎣ ⎦=⎣ ⎦ XY = ⎣ 3 4 1 0 4 3 ⎡ YX = ⎣ 0 1 1 0 ⎤⎡ ⎦⎣ 1 3 2 4 ⎤ ⎡ ⎦=⎣ 3 4 1 2 ⎤ ⎦.

This is a very simple example, but the implications are obvious. Even in cases where pre-multiplication and post-multiplication are possible, these are different operations and matrix multiplication is not commutative. Recall also the claim that the identity matrix I is operationally equivalent to 1 in matrix terms rather than the seemingly more obvious J matrix. Let us now test this claim on a simple matrix, ﬁrst with I: ⎡ ⎤⎡ ⎤ 1 2 1 0 ⎦⎣ ⎦ XI = ⎣ 3 4 0 1 ⎡ =⎣ and then with J: ⎡ XJ2 = ⎣ ⎡ =⎣ (1)(1) + (2)(0) (1)(0) + (2)(1) (3)(1) + (4)(0) (3)(0) + (4)(1) ⎤⎡ ⎦⎣ ⎤ ⎦ ⎤ ⎡ ⎤ ⎦. ⎤ ⎡ ⎦=⎣ 1 3 2 4 ⎤ ⎦,

1 3

2 4

1 1

1 1

(1)(1) + (2)(1) (1)(1) + (2)(1) (3)(1) + (4)(1) (3)(1) + (4)(1)

⎦=⎣

3 7

3 7

The result here is interesting; post-multiplying by I returns the X matrix to its original form, but post-multiplying by J produces a matrix where values are the sum by row. What about pre-multiplication? Pre-multiplying by I also returns

which is now the sum down columns assigned as row values. This means that the J matrix can be very useful in calculations (including linear regression methods), but it does not work as a “one” in matrix terms. There is also a very interesting multiplicative property of the J matrix, particularly for nonsquare forms: J J =n J .
(p×k)

which shows the switching of rows two and three as well as the conﬁnement of multiplication by 3 to the ﬁrst row.

3.5 Matrix Transposition Another operation that is commonly performed on a single matrix is transposition. We saw this before in the context of vectors: switching between column and row forms. For matrices, this is slightly more involved but straightforward to understand: simply switch rows and columns. The transpose of an i × j

In this way the inner structure of the matrix is preserved but the shape of the matrix is changed. An interesting consequence is that transposition allows us to calculate the “square” of some arbitrary-sized i × j matrix: X X is always conformable, as is XX , even if i = j. We can also be more precise about the deﬁnition of symmetric and skew-symmetric matrices. Consider now some basic properties of transposition.

3.6 Advanced Topics This section contains a set of topics that are less frequently used in the social sciences but may appear in some literatures. Readers may elect to skip this section or use it for reference only.

Another particularistic matrix is a involutory matrix, which has the property that when squared it produces an identity matrix. For example, ⎡ ⎤2 −1 0 ⎣ ⎦ = I, 0 1 although more creative forms exist.

3.6.2 Vectorization of Matrices Occasionally it is convenient to rearrange a matrix into vector form. The most common way to do this is to “stack” vectors from the matrix on top of each other, beginning with the ﬁrst column vector of the matrix, to form one long column vector. Speciﬁcally, to vectorize an i × j matrix X, we consecutively stack the j-length column vectors to obtain a single vector of length ij. This is denoted vec(X) and has some obvious properties, such as svec(X) = vec(sX) for some vector s and vec(X + Y) = vec(X) + vec(Y) for matrices conformable by addition. Returning to our simple example, ⎡ ⎡ vec ⎣ 1 3 2 ⎤ ⎤

⎥ ⎢ ⎥ ⎢ ⎢ 3 ⎥ ⎥. ⎦=⎢ ⎥ ⎢ ⎢ 2 ⎥ 4 ⎦ ⎣ 4

1

Interestingly, it is not true that vec(X) = vec(X ) since the latter would stack rows instead of columns. And vectorization of products is considerably more involved (see the next section). A ﬁnal, and sometimes important, type of matrix multiplication is the Kronecker product (also called the tensor product), which comes up naturally in the statistical analyses of time series data (data recorded on the same measures of interest at different points in time). This is a slightly more abstract

The vectorize function above has a product that involves the Kronecker function. For i×j matrix X and j×k matrix Y, we get vec(XY) = (I⊗X)vec(Y), where I is an identity matrix of order i. For three matrices this is only slightly more complex: vec(XYZ) = (Z ⊗ X)vec(Y), for k × matrix Z. Kronecker products have some other interesting properties as well (matrix inversion is discussed in the next chapter):

Here the notation tr() denotes the “trace,” which is just the sum of the diagonal values going from the uppermost left value to the lowermost right value, for square matrices. Thus the trace of an identity matrix would be just its order. This is where we will pick up next in Chapter 4.

Recalculate the two outer product operations in Example 3.2 only by using the vector (−1) × [3, 3, 3] instead of [3, 3, 3]. What is the interpretation of the result with regard to the direction of the resulting row and column vectors compared with those in the example?

3.3

Show that v − w cos(θ) =
v vw w

2

=

v

2

+ w

2

−2 v

w cos θ implies

.

3.4

What happens when you calculate the difference norm (||u − v||2 = ||u||2 − 2(u · v) + ||v||2 ) for two orthogonal vectors? How is this different from the multiplication norm for two such vectors?

3.5

Explain why the perpendicularity property is a special case of the triangle inequality for vector p-norms.

3.6

For p-norms, explain why the Cauchy-Schwarz inequality is a special case of H¨ lder’s inequality. o

3.7

Show that pre-multiplication and post-multiplication with the identity matrix are equivalent.

3.8

Recall that an involutory matrix is one that has the characteristic X 2 = I. Can an involutory matrix ever be idempotent?

3.9

For the following matrix, calculate Xn for n = 2, 3, 4, 5. Write a rule

An equitable matrix is a square matrix of order n where all entries are positive and for any three values i, j, k < n, xij xjk = xik . Show that for equitable matrices of order n, X 2 = nX. Give an example of an equitable matrix.

128 3.13

Linear Algebra: Vectors, Matrices, and Operations Communication within work groups can sometimes be studied by looking analytically at individual decision processes. Roby and Lanzetta (1956) studied at this process by constructing three matrices: OR, which maps six observations to six possible responses; P O, which indicates which type of person from three is a source of information for each observation; and P R, which maps who is responsible of the three for each of the six responses. They give these matrices (by example) as
R1 R2 R3 R4 R5 R6 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0

OR =

B B O2 B B B O3 B B B O4 B B B B O5 B @ O6

O1

0

1

C C 0 C C C 0 C C C. 0 C C C C 1 C A 1

PO =

B B P2 B 0 @ 0 P3

P1

0

O1 O2 O3 O4 O5 O6 1 0 1 0 1 0 0 0 1 0 0 0 1 0

1

C C 0 C. A 1

PR =

B B R2 B B B R3 B B B R4 B B B B R5 B @ R6

R1

0

P1 P2 P3 1 1 0 0 0 0 0 0 1 1 0 0 0

1

C C 1 C C C 0 C C C. 0 C C C C 1 C A 1

Exercises

129

The claim is that multiplying these matrices in the order OR, P O, P R produces a personnel-only matrix (OP R) that reﬂects “the degree of operator interdependence entailed in a given task and personnel structure” where the total number of entries is proportional to the system complexity, the entries along the main diagonal show how autonomous the relevant agent is, and off-diagonals show sources of information in the organization. Perform matrix multiplication in this order to obtain the OP R matrix using transformations as needed where your ﬁnal matrix has a zero in the last entry of the ﬁrst row. Which matrix most affects the diagonal values of OP R when it is manipulated? 3.14 Singer and Spilerman (1973) used matrices to show social mobility between classes. These are stochastic matrices indicating different social class categories where the rows must sum to 1. In this construction a diagonal matrix means that there is no social mobility. Test their claim that the following matrix is the cube root of a stochastic matrix: ⎛ P =⎝
1 3

1 2 (1 1 2 (1

1 − 1/ 3 − 3 1 + 1/ 3 − 3

1 2 (1 1 2 (1

+ 1/ 3 − 1 3
1 − 1/ 3 − 3 .

⎞ ⎠

3.15

Element-by-element matrix multiplication is a Hadamard product (and sometimes called a Schur product), and it is denoted with either “∗” or “ ” (and occasionally “◦”) This element-wise process means that if X and Y are arbitrary matrices of identical size, the Hadamard product is X Y whose ijth element (XYij ) is Xij Yij . It is trivial to see that X Y=Y X (an interesting exception to general matrix multiplication properties), but show that for two nonzero matrices tr(X Y) = tr(X) · tr(Y). For some nonzero matrix X what does I X do? For an order k J matrix, is tr(J J) different from tr(JJ)? Show why or why not.

is a “block diagonal” matrix and explain this form. Use generic xij values or some other general form to denote elements of X. Does this say anything about the Kronecker product using an identity matrix? 3.19 Calculate the LU decomposition of the matrix [ 2 3 ] using your pre47 ferred software such as with the lu function of the Matrix library in the R environment. Reassemble the matrix by doing the multiplication without using software. 3.20 The Jordan product for matrices is deﬁned by X∗Y = 1 (XY + YX), 2

This chapter introduces more theoretical and abstract properties of vectors and matrices. We already (by now!) know the mechanics of manipulating these forms, and it is important to carry on to a deeper understanding of the properties asserted by speciﬁc row and column formations. The last chapter gave some of the algebraic basics of matrix manipulation, but this is really insufﬁcient for understanding the full scope of linear algebra. Importantly, there are characteristics of a matrix that are not immediately obvious from just looking at its elements and dimension. The structure of a given matrix depends not only on the arrangement of numbers within its rectangular arrangement, but also on the relationship between these elements and the “size” of the matrix. The idea of size is left vague for the moment, but we will shortly see that there are some very speciﬁc ways to claim size for matrices, and these have important theoretical properties that deﬁne how a matrix works with other structures. This chapter demonstrates some of these properties by providing information about the internal dynamics of matrix structure. Some of these topics are a bit more abstract than those in the last chapter. 132

4.2 Space and Time 4.2 Space and Time

133

We have already discussed basic Euclidean geometric systems in Chapter 1. Recall that Cartesian coordinate systems deﬁne real-measured axes whereby points are uniquely deﬁned in the subsequent space. So in a Cartesian plane deﬁned by R2 , points deﬁne an ordered pair designating a unique position on this 2-space. Similarly, an ordered triple deﬁnes a unique point in R3 3-space. Examples of these are given in Figure 4.1.
Fig. 4.1. Visualizing Space

y

z

x

x

y

x and y in 2−space

x,y, and z in 3−space

What this ﬁgure shows with the lines is that the ordered pair or ordered triple deﬁnes a “path” in the associated space that uniquely arrives at a single point. Observe also that in both cases the path illustrated in the ﬁgure begins at the origin of the axes. So we are really deﬁning a vector from the zero point to the arrival point, as shown in Figure 4.2. Wait! This looks like a ﬁgure for illustrating the Pythagorean Theorem (the little squares are reminders that these angles are right angles). So if we wanted to get the length of the vectors, it would simply be and x2 + y 2 in the ﬁrst panel x2 + y 2 + z 2 in the second panel. This is the intuition behind the basic

Thinking broadly about the two vectors in Figure 4.2, they take up an amount of “space” in the sense that they deﬁne a triangular planar region bounded by the vector itself and its two (left panel) or three (right panel) projections against the axes where the angle on the axis from this projection is necessarily a right angle (hence the reason that these are sometimes called orthogonal projections). Projections deﬁne how far along that axis the vector travels in total. Actually a projection does not have be just along the axes: We can project a vector v against another vector u with the following formula: p = projection of v on to u = u·v u u u .

This is shown in Figure 4.3. We can think of the second fraction on the righthand side above as the unit vector in the direction of u, so the ﬁrst fraction is a scalar multiplier giving length. Since the right angle is preserved, we can also think about rotating this arrangement until v is lying on the x-axis. Then it will be the same type of projection as before. Recall from before that two vectors at right angles, such as Cartesian axes, are called orthogonal. It should

4.2 Space and Time

135

be reasonably easy to see now that orthogonal vectors produce zero-length projections.

Fig. 4.3. Vector Projection, Addition, and Subtraction

v+u

v

v−u

u

p

Another interesting case is when one vector is simply a multiple of another, say (2, 4) and (4, 8). The lines are then called collinear and the idea of a projection does not make sense. The plot of these vectors would be along the exact same line originating at zero, and we are thus adding no new geometric information. Therefore the vectors still consume the same space. Also shown in Figure 4.3 are the vectors that result from v+u and v−u with angle θ between them. The area of the parallelogram deﬁned by the vector v+u shown in the ﬁgure is equal to the absolute value of the length of the orthogonal vector that results from the cross product: u×v. This is related to the projection in the following manner: Call h the length of the line deﬁning the projection in the ﬁgure (going from the point p to the point v). Then the parallelogram has size that is height times length: h u from basic geometry. Because the triangle created by the projection is a right triangle, from the trigonometry rules

136

Linear Algebra Continued: Matrix Structure

in Chapter 2 (page 55) we get h = v sin θ, where θ is the angle between u and v. Substituting we get u × v = u v sin θ (from an exercise in the last chapter). Therefore the size of the parallelogram is |v + u| since the order of the cross product could make this negative. Naturally all these principles apply in higher dimension as well. These ideas get only slightly more complicated when discussing matrices because we can think of them as collections of vectors rather than as purely rectangular structures. The column space of an i × j matrix X consists of every possible linear combination of the j columns in X, and the row space of the same matrix consists of every possible linear combination of the i rows in X. This can be expressed more formally for the i × j matrix X as all column vectors x.1 , x.2 , . . . , x.j , • Column Space: and scalars s1 , s2 , . . . , sj producing vectors s1 x.1 + s2 x.2 + · · · + sj x.j

cult problem faced by analysts of survey data is that respondents often answer ordered questions based on their own interpretation of the scale. This means that an answer of “strongly agree” may have different meanings across a survey because individuals anchor against different response points, or they interpret the spacing between categories differently. Aldrich and McKelvey (1977) approached this problem by applying a linear transformation to data on the placement of presidents on a spatial issue dimension (recall the spatial representation in Figure 1.1). The key to their thinking was that while respondent i places candidate j at Xij on an ordinal scale from the survey instrument, such as a 7-point “dove” to “hawk” measure, their real view was Yij along some smoother underlying metric with ﬁner distinctions. Aldrich and McKelvey gave this hypothetical example for three voters:

The graphic for Y above is done to suggest a noncategorical measure such as along R. To obtain a picture of this latent variable, Aldrich and McKelvey suggested a linear transformation for each voter to relate observed categorical scale to this underlying metric: ci + ωi Xij . Thus the perceived candidate

138

Linear Algebra Continued: Matrix Structure ⎤ ⎥ ⎥ ⎥ ⎥, ⎥ ⎥ ⎦

positions for voter i are given by ⎡

⎢ ⎢ ⎢ ci + ωi Xi2 Yi = ⎢ . ⎢ . ⎢ . ⎣ ci + ωi XiJ

ci + ωi Xi1

which gives a better vector of estimates for the placement of all J candidates by respondent i because it accounts for individual-level “anchoring” by each respondent, ci . Aldrich and McKelvey then estimated each of the values of c and ω. The value of this linear transformation is that it allows the researchers to see beyond the limitations of the categorical survey data. Now let x.1 , x.2 , . . . , x.j be a set of column vectors in Ri (i.e., they are all length i). We say that the set of linear combinations of these vectors (in the sense above) is the span of that set. Furthermore, any additional vector in Ri is spanned by these vectors if and only if it can be expressed as a linear combination of x.1 , x.2 , . . . , x.j . It should be somewhat intuitive that to span Ri here j ≥ i must be true. Obviously the minimal condition is j = i for a set of linearly independent vectors, and in this case we then call the set a basis. This brings us to a more general discussion focused on matrices rather than on vectors. A linear space, X, is the nonempty set of matrices such that remain closed under linear transformation: • If X1 , X2 , . . . , Xn are in X, • and s1 , s2 , . . . , sn are any scalars, • then Xn+1 = s1 X1 + s2 X2 + · · · + sn Xn is in X. That is, linear combinations of matrices in the linear space have to remain in this linear space. In addition, we can deﬁne linear subspaces that represent some enclosed region of the full space. Obviously column and row spaces as discussed above also comprise linear spaces. Except for the pathological case where the linear space consists only of a null matrix, every linear space contains an inﬁnite number of matrices.

4.3 The Trace and Determinant of a Matrix

139

Okay, so we still need some more terminology. The span of a ﬁnite set of matrices is the set of all matrices that can be achieved by a linear combination of the original matrices. This is confusing because a span is also a linear space. Where it is useful is in determining a minimal set of matrices that span a given linear space. In particular, the ﬁnite set of linearly independent matrices in a given linear space that span the linear space is called a basis for this linear space (note the word “a” here since it is not unique). That is, it cannot be made a smaller set because it would lose the ability to produce parts of the linear space, and it cannot be made a larger set because it would then no longer be linearly independent. Let us make this more concrete with an example. A 3 × 3 identity matrix is clearly a basis for R3 (the three-dimensional space of real numbers) because any three-dimensional coordinate, [r1 , r2 , r3 ] can be produced by multiplication of I by three chosen scalars. Yet, the matrices deﬁned by not qualify as a basis (although the second still spans R ).
100 001 001 3

and

1000 0100 0011

do

4.3 The Trace and Determinant of a Matrix We have already noticed that the diagonals of a square matrix have special importance, particularly in the context of matrix multiplication. As mentioned in Chapter 3, a very simple way to summarize the overall magnitude of the diagonals is the trace. The trace of a square matrix is simply the sum of the diagonal values tr(X) =
k i=1

Another important, but more difﬁcult to calculate, matrix summary is the determinant. The determinant uses all of the values of a square matrix to

provide a summary of structure, not just the diagonal like the trace. First let us look at how to calculate the determinant for just 2 × 2 matrices, which is the difference in diagonal products: det(X) = |X| = x11 x21 x12 x22 = x11 x22 − x12 x21 .

4.3 The Trace and Determinant of a Matrix

141

The notation for a determinant is expressed as det(X) or |X|. Some simple numerical examples are 1 3 10 4 2 4
1 2

= (1)(4) − (2)(3) = −2

1

= (10)(1) −

1 2

(4) = 8

2 3 6 9

= (2)(9) − (3)(6) = 0.

The last case, where the determinant is found to be zero, is an important case as we shall see shortly. Unfortunately, calculating determinants gets much more involved with square matrices larger than 2 × 2. First we need to deﬁne a submatrix. The submatrix is simply a form achieved by deleting rows and/or columns of a matrix, leaving the remaining elements in their respective places. So for the matrix X, notice the following submatrices whose deleted rows and columns are denoted by subscripting: ⎡ ⎤

To generalize further for n × n matrices we ﬁrst need to deﬁne the following: The ijth minor of X for xij , |X[ij] | is the determinant of the (n − 1) × (n − 1) submatrix that results from taking the ith row and jth column out. Continuing, the cofactor of X for xij is the minor signed in this way: (−1)i+j |X[ij] |. To

142

Linear Algebra Continued: Matrix Structure

exhaust the entire matrix we cycle recursively through the columns and take sums with a formula that multiplies the cofactor by the determining value:
n

det(X) =
j=1

(−1)i+j xij |X[ij] |

for some constant i. This is not at all intuitive, and in fact there are some subtleties lurking in there (maybe I should have taken the blue pill). First, recursive means that the algorithm is applied iteratively through progressively smaller submatrices X[ij] . Second, this means that we lop off the top row and multiply the values across the resultant submatrices without the associated column. Actually we can pick any row or column to perform this operation, because the results will be equivalent. Rather than continue to pick apart this formula in detail, just look at the application to a 3 × 3 matrix:

x11 x21 x31

x12 x22 x32

x13 x23 x33 x22 x32 x23 x33 x11 x31 x13 x33 x11 x21 x12 x22

= (+1)x11

+(−1)x12

+(+1)x13

.

Now the problem is easy because the subsequent three determinant calculations are on 2 × 2 matrices. Here we picked the ﬁrst row as the starting point as per the standard algorithm. In the bad old days before ubiquitous and powerful computers people who performed these calculations by hand ﬁrst looked to start with rows or columns with lots of zeros because each one would mean that the subsequent contribution was automatically zero and did not need to be calculated. Using this more general process means that one has to be more careful about the alternating signs in the sum since picking the row or column to “pivot” on determines the order. For instance, here are the signs for a 7 × 7

calculating the determinants of matrices of this magnitude and greater, but mostly these are relics from slide rule days. Sometimes the shortcuts are revealing about matrix structure. Ishizawa (1991), in looking at the return to scale of public inputs and its effect on the transformation curve of an economy, needed to solve a system of equations by taking the determinant of the matrix ⎡ ⎤ ⎥ ⎥ ⎥ ⎥, ⎥ ⎥ ⎦

The big help here was the two zeros on the top row that meant that we could stop our 4 × 4 calculations after two steps. Fortunately this trick works again because we have the same structure remaining in the 3 × 3 case. Let us be a bit more strategic though and deﬁne the 2 × 2 lower right matrix as

144 D=
1 2

Linear Algebra Continued: Matrix Structure
k1 k2

, so that we get the neat simpliﬁcation

det =

1 2

k |D| − k 1 2 |D| = ( 1 k 2 − k 1 2 )|D| = |D|2 .

Because of the squaring operations here this is guaranteed to be positive, which was substantively important to Ishizawa.

The trace and the determinant have interrelated uses and special properties as well. For instance, Kronecker products on square matrices have the properties tr(X ⊗ Y) = tr(X)tr(Y), and |X ⊗ Y| = |X| |Y|j for the j × j matrix X and the × matrix Y (note the switching of exponents). There are some general properties of determinants to keep in mind:

4.3 The Trace and Determinant of a Matrix It helps some people to think abstractly about the meaning of a determinant. If the columns of an n × n matrix X are treated as vectors, then the area of the parallelogram created by an n-dimensional space of these vectors is the absolute value of the determinant of X, where the vectors originate at zero and the opposite point of the parallelogram is determined by the product of the columns (a cross product of these vectors, as in Section 4.2). Okay, maybe that is a bit too abstract! Now view the determinant of the 2×2 matrix [ 1 2 ]. 34 The resulting paral§

145

Spatial Representation of a Determinant
¨

8

4
¦ ¥

3

lelogram looks like the ﬁgure on the right. This ﬁgure indicates that the determinant is somehow a description of the size of a matrix in the geometric sense. Suppose that our example matrix were slightly different, say [ 1 2 ]. 24 1 2 3

This does not seem like a very drastic change, yet it is quite fundamentally different. It is not too hard to see that the size of the resulting parallelogram would be zero since the two column (or row) vectors would be right on top of each other in the ﬁgure, that is, collinear. We know this also almost immediately from looking at the calculation of the determinant (ad − bc). Here we see that two lines on top of each other produce no area. What does this mean? It means that the column dimension exceeds the offered “information” provided by this matrix form since the columns are simply scalar multiples of each other.

146

Linear Algebra Continued: Matrix Structure 4.4 Matrix Rank

The ideas just described are actually more important than they might appear at ﬁrst. An important characteristic of any matrix is its rank. Rank tells us the “space” in terms of columns or rows that a particular matrix occupies, in other words, how much unique information is held in the rows or columns of a matrix. For example, a matrix that has three columns but only two columns of unique information is given by
01 11 11 01 11 22

. This is also true for the matrix

,

because the third column is just two times the second column and therefore has no new relational information to offer. More speciﬁcally, when any one column of a matrix can be produced by nonzero scalar multiples of other columns added, then we say that the matrix is not full rank (sometimes called short rank). In this case at least one column is linearly dependent. This simply means that we can produce the relative relationships deﬁned by this column from the other columns and it thus adds nothing to our understanding of the relationships deﬁned by the matrix. One way to look at this is to say that the matrix in question does not “deserve” its number of columns. Conversely, the collection of vectors determined by the columns is said to be linearly independent columns if the only set of scalars, s1 , s2 , . . . , sj , that satisﬁes s1 x.1 + s2 x.2 + · · · + sj x.j = 0 is a set of all zero values, s1 = s2 = . . . = sj = 0. This is just another way of looking at the same idea since such a condition means that we cannot reproduce one column vector from a linear combination of the others. Actually this emphasis on columns is somewhat unwarranted because the rank of a matrix is equal to the rank of its transpose. Therefore, everything just said about columns can also be said about rows. To restate, the row rank of any matrix is also its column rank. This is a very important result and is proven in virtually every text on linear algebra. What makes this somewhat confusing is additional terminology. An (i × j) matrix is full column rank if its rank equals the number of columns, and it is full row rank if its rank equals

4.4 Matrix Rank

147

its number of rows. Thus, if i > j, then the matrix can be full column rank but never full row rank. This does not necessarily mean that it has to be full column rank just because there are fewer columns than rows. It should be clear from the example that a (square) matrix is full rank if and only if it has a nonzero determinant. This is the same thing as saying that a matrix is full rank if it is nonsingular or invertible (see Section 4.6 below). This is a handy way to calculate whether a matrix is full rank because the linear dependency within can be subtle (unlike our example above). In the next section we will explore matrix features of this type.

Example 4.3:

Structural Equation Models. In their text Hanushek and

Jackson (1977, Chapter 9) provided a technical overview of structural equation models where systems of equations are assumed to simultaneously affect each other to reﬂect endogenous social phenomena. Often these models are described in matrix terms, such as their example (p. 265)

⎡

⎢ ⎢ ⎢ 0 −1 γ56 ⎢ ⎢ A = ⎢ 0 γ65 −1 ⎢ ⎢ ⎢ β34 0 β36 ⎣ β44 0 β46

γ24

1

γ26

0

⎥ ⎥ 0 ⎥ ⎥ ⎥ 0 0 ⎥. ⎥ ⎥ 0 β32 ⎥ ⎦ 0 β42 0

−1

⎤

Without doing any calculations we can see that this matrix is of rank less than 5 because there is a column of all zeros. We can also produce this result by calculating the determinant, but that is too much trouble. Matrix determinants are not changed by multiplying the matrix by an identity in advance, multiplying by a permutation matrix in advance,or by taking transformations.

that is immediately identiﬁable as having a zero determinant by the general determinant form given on page 142 because each ith minor (the matrix that remains when the ith row and column are removed) is multiplied by the ith value on the ﬁrst row.

Some rank properties are more specialized. An idempotent matrix has the property that rank(X) = tr(X), and more generally, for any square matrix with the property that A2 = sA, for some scalar s srank(X) = tr(X).

To emphasize that matrix rank is a fundamental principle, we now give some standard properties related to other matrix characteristics.

We have seen matrix “size” as described by the trace, determinant, and rank. Additionally, we can describe matrices by norming, but matrix norms are a little bit more involved than the vector norms we saw before. There are two general types, the trace norm (sometimes called the Euclidean norm or the Frobenius norm): X
F

⎡ =⎣
i j

⎤1 2 |xij |2 ⎦

(the square root of the sum of each element squared), and the p-norm: X
p

= max Xv p ,
v
p

150

Linear Algebra Continued: Matrix Structure

which is deﬁned with regard to the unit vector v whose length is equal to the number of columns in X. For p = 1 and an I × J matrix, this reduces to summing absolute values down columns and taking the maximum:
I

X

1

= max
J i=1

|xij |.

Conversely, the inﬁnity version of the matrix p-norm sums across rows before taking the maximum:
J

X

∞

= max
I j=1

|xij |.

Like the inﬁnity form of the vector norm, this is somewhat unintuitive because there is no apparent use of a limit. There are some interesting properties of matrix norms:

showing the second property above. Example 4.5: Schwarz Inequality for Matrices. Using the same X and

Y matrices and the p = 1 norm, observe that |X · Y| (12) + (0) X
1

Y

1

max(8, 3) · max(4, 6)

showing that the inequality holds: 12 < 48. This is a neat property because it shows a relationship between the trace and matrix norm.

4.6 Matrix Inversion Just like scalars have inverses, some square matrices have a matrix inverse. The inverse of a matrix X is denoted X−1 and deﬁned by the property XX−1 = X−1 X = I. That is, when a matrix is pre-multiplied or post-multiplied by its inverse the result is an identity matrix of the same size. For example, consider the following matrix and its inverse: ⎡ ⎣ 1 3 2 4 ⎤⎡ ⎦⎣ −2.0 1.5 1.0 −0.5 ⎤ ⎡ −2.0 1.0 ⎤⎡ ⎦⎣ 1 2 3 4 ⎤ ⎡ 1 0 0 1 ⎤ ⎦

⎦=⎣

1.5 −0.5

⎦=⎣

152

Linear Algebra Continued: Matrix Structure

Not all square matrices are invertible. A singular matrix cannot be inverted, and often “singular” and “noninvertible” are used as synonyms. Usually matrix inverses are calculated by computer software because it is quite time-consuming with reasonably large matrices. However, there is a very nice trick for immediately inverting 2 × 2 matrices, which is given by ⎡ X=⎣ ⎤ ⎦ ⎤ ⎦.

x11 x21

x12 x22 ⎡

X−1 = det(X)−1 ⎣

x22 −x21

−x12 x11

A matrix inverse is unique: There is only one matrix that meets the multiplicative condition above for a nonsingular square matrix. For inverting larger matrices there is a process based on Gauss-Jordan elimination that makes use of linear programming to invert the matrix. Although matrix inversion would normally be done courtesy of software for nearly all problems in the social sciences, the process of Gauss-Jordan elimination is a revealing insight into inversion because it highlights the “inverse” aspect with the role of the identity matrix as the linear algebra equivalent of 1. Start with the matrix of interest partitioned next to the identity matrix and allow the following operations:

• Any row may be multiplied or divided by a scalar. • Any two rows may be switched. • Any row may be multiplied or divided by a scalar and then added to another row. Note: This operation does not change the original row; its multiple is used but not saved.

4.6 Matrix Inversion

153

Of course the goal of these operations has not yet been given. We want to iteratively apply these steps until the identity matrix on the right-hand side is on the left-hand side. So the operations are done with the intent of zeroing out the off-diagonals on the left matrix of the partition and then dividing to obtain 1’s on the diagonal. During this process we do not care about what results on the right-hand side until the end, when this is known to be the inverse of the original matrix. Let’s perform this process on a 3 × 3 matrix: ⎡ ⎤ ⎤

⎢ ⎢ ⎢ 4 5 ⎣ 1 8

1 2

3

⎥ ⎥ 6 ⎥0 1 ⎦ 9 0 0

1 0

0

⎥ ⎥ 0 ⎥. ⎦ 1

Now multiply the ﬁrst row by −4, adding it to the second row, and multiply the ﬁrst row by −1, adding it to the third row: ⎡ ⎤ ⎤

⎢ ⎢ ⎢ 0 ⎣ 0

1

⎥ ⎥ −3 −6 ⎥ −4 1 ⎦ 6 6 −1 0

2

3

1 0

0

⎥ ⎥ 0 ⎥. ⎦ 1

Multiply the second row by 1 , adding it to the ﬁrst row, and simply add this 2 same row to the third row: ⎡ ⎤ ⎤

⎢ ⎢ ⎢ 0 ⎣ 0

1

⎥ ⎥ −3 −6 ⎥ −4 ⎦ 3 0 −5

1 2

0

−1

1 2

1 1

⎥ ⎥ 0 ⎥. ⎦ 1

0

Multiply the third row by − 1 , adding it to the ﬁrst row, and add the third row 6 (un)multiplied to the second row:

154

Linear Algebra Continued: Matrix Structure

⎡

⎢ ⎢ ⎢ 0 0 ⎣ 0 3

1 0

⎥ ⎥ −6 ⎥ −9 ⎦ 0 −5

0

⎤

−1 6

1 3

−1 6

⎤

2 1

⎥ ⎥ 1 ⎥. ⎦ 1

Finally, just divide the second row by −6 and the third row by −3, and then switch their places:

⎡

⎢ ⎢ ⎢ 0 1 ⎣ 0 0

1 0

0

⎤

⎥ ⎥ 5 0 ⎥ −3 ⎦ 3 1 2

1 −6

1 3 1 3

−1 6
1 3

⎤ ⎥ ⎥ ⎥, ⎦

−1 3

−1 6

thus completing the operation. This process also highlights the fact that matrices are representations of linear equations. The operations we performed are linear transformations, just like those discussed at the beginning of this chapter. We already know that singular matrices cannot be inverted, but consider the described inversion process applied to an obvious case: ⎡ ⎣ ⎤ ⎦ ⎤ ⎦.

1 1

0 0

1 0 0 1

It is easy to see that there is nothing that can be done to put a nonzero value in the second column of the matrix to the left of the partition. In this way the Gauss-Jordan process helps to illustrate a theoretical concept. Most of the properties of matrix inversion are predictable (the last property listed relies on the fact that the product of invertible matrices is always itself invertible):

nary least squares” method for obtaining regression parameters proceeds as follows. Suppose that y is the outcome variable of interest and X is a matrix of explanatory variables with a leading column of 1’s. What we would like ˆ is the vector b that contains the intercept and the regression slope, which is ˆ calculated by the equation b = (X X)−1 X y, which might have seemed hard before this point in the chapter. What we need to do then is just a series of multiplications, one inverse, and two transposes. To make the example more informative, we can look at some actual data with two variables of interest (even though we could just do this in scalar algebra since it is just a bivariate problem). Governments often worry about the economic condition of senior citizens for political and social reasons. Typ-

156

Linear Algebra Continued: Matrix Structure

ically in a large industrialized society, a substantial portion of these people obtain the bulk of their income from government pensions. One important question is whether there is enough support through these payments to provide subsistence above the poverty rate. To see if this is a concern, the European Union (EU) looked at this question in 1998 for the (then) 15 member countries with two variables: (1) the median (EU standardized) income of individuals age 65 and older as a percentage of the population age 0–64, and (2) the percentage with income below 60% of the median (EU standardized) income of the national population. The data from the European Household Community Panel Survey are
Nation Netherlands Luxembourg Sweden Germany Italy Spain Finland France United.Kingdom Belgium Austria Denmark Portugal Greece Ireland Relative Income 93.00 99.00 83.00 97.00 96.00 91.00 78.00 90.00 78.00 76.00 84.00 68.00 76.00 74.00 69.00 Poverty Rate 7.00 8.00 8.00 11.00 14.00 16.00 17.00 19.00 21.00 22.00 24.00 31.00 33.00 33.00 34.00

So the y vector is the second column of the table and the X matrix is the ﬁrst column along with the leading column of 1’s added to account for the intercept (also called the constant, which explains the 1’s). The ﬁrst quantity that we want to calculate is ⎡ ⎤ ⎦,

These results are shown in Figure 4.4 for the 15 EU countries of the time, with a line for the estimated underlying trend that has a slope of m = −0.77 (rounded) and an intercept at b = 84 (also rounded). What does this mean? It means that for a one-unit positive change (say from 92 to 93) in over-65 relative income, there will be an expected change in over-65 poverty rate of −0.77 (i.e., a reduction). This is depicted in Figure 4.4. Once one understands linear regression in matrix notation, it is much easier to see what is happening. For instance, if there were a second explanatory variable (there are many more than one in social science models), then it would simply be an addition column of the X matrix and all the calculations would proceed exactly as we have done here.

4.7 Linear Systems of Equations A basic and common problem in applied mathematics is the search for a solution, x, to the system of simultaneous linear equations deﬁned by Ax = y,

where A ∈ Rp×q , x ∈ Rq , and y ∈ Rp . If the matrix A is invertible, then there exists a unique, easy-to-ﬁnd, solution vector x = A−1 y satisfying Ax = y. Note that this shows the usefulness of a matrix inverse. However, if the system of linear equations in Ax = y is not consistent, then there exists no solution. Consistency simply means that if a linear relationship exists in the rows of A, it must also exist in the corresponding rows of y. For example, the following simple system of linear equations is consistent: ⎡ 1 ⎣ 2 ⎤ ⎡ ⎤ 2 3 ⎦x = ⎣ ⎦ 4 6

because the second row is two times the ﬁrst across (x|y). This implies that y is contained in the linear span of the columns (range) of A, denoted as y ∈ R(A). Recall that a set of linearly independent vectors (i.e., the columns here) that span a vector subspace is called a basis of that subspace. Conversely, the following

because there is no solution for x that satisﬁes both rows. In the notation above this is denoted y ∈ R(A), and it provides no further use without modiﬁcation / of the original problem. It is worth noting, for purposes of the discussion below, that if A−1 exists, then Ax = y is always consistent because there exist no linear relationships in the rows of A that must be satisﬁed in y. The inconsistent case is the more common statistically in that a solution that minimizes the squared sum of the inconsistencies is typically applied (ordinary least squares). In addition to the possibilities of the general system of equations Ax = y having a unique solution and no solution, this arbitrary system of equations can also have an inﬁnite number of solutions. In fact, the matrix [ 1 2 ] above is 34 such a case. For example, we could solve to obtain x = (1, 1) , x = (−1, 2) , x = (5, −1) , and so on. This occurs when the A matrix is singular: rank(A) = dimension(R(A)) < q. When the A matrix is singular at least one column vector is a linear combination of the others, and the matrix therefore contains redundant information. In other words, there are q < q independent column vectors in A.

4.8 Eigen-Analysis of Matrices We start this section with a brief motivation. Apparently a single original population undergoes genetic differentiation once it is dispersed into new geographic regions. Furthermore, it is interesting anthropologically to compare the rate of this genetic change with changes in nongenetic traits such as language, culture, and use of technology. Sorenson and Kenmore (1974) explored the genetic drift of proto-agricultural people in the Eastern Highlands of New Guinea with the idea that changes in horticulture and mountainous geography both determined patterns of dispersion. This is an interesting study because it uses biological evidence (nine alternative forms of a gene) to make claims about the relatedness of groups that are geographically distinct but similar ethnohistorically and linguistically. The raw genetic information can be summarized in a large matrix, but the information in this form is not really the primary interest. To see differences and similarities Sorenson and Kenmore transformed these variables into just two individual factors (new composite variables) that appear to explain the bulk of the genetic variation.

4.8 Eigen-Analysis of Matrices Once that is done it is easy to graph the groups in a single plot and then look at similarities geometrically. This useful result is shown in the ﬁgure at right, where we see the placement of these linguistic groups according to the similarity in blood-group genetics. The tool they used for turnAgarabi + Pawaian + Gimi + Gimi−Labogai + Kamano + North Fore + South Genatei + Fore + Awa + Kanite + Keiagana + Tairora + Usurufa + Gadsup +

161

Linguistic Groups Genetically

ing the large multidimensional matrix of unwieldy data into an intuitive twodimensional structure was eigenanalysis. A useful and theoretically important feature of a given square matrix is the set of eigenvalues associated with this matrix. Every p × p matrix X has p scalar values, λi , i = 1, . . . , p, such that Xhi = λi hi for some corresponding vector hi . In this decomposition, λi is called an eigenvalue of X and hi is called an eigenvector of X. These eigenvalues show important structural features of the matrix. Confusingly, these are also called the characteristic roots and characteristic vectors of X, and the process is also called spectral decomposition. The expression above can also be rewritten to produce the characteristic

from which we take the eigenvalues from the diagonal. Note the descending order. To see the mechanics of this process more clearly, consider ﬁnding the eigenvalues of ⎡ 3 Y=⎣ 2 ⎤ −1 ⎦. 0

To do this we expand and solve the determinant of the characteristic equation: |Y − λI| = (3 − λ)(0 − λ) − (−2) = λ2 − 3λ + 2 and the only solutions to this quadratic expression are λ1 = 1, λ2 = 2. In fact, for a p × p matrix, the resulting characteristic equation will be a polynomial of order p. This is why we had a quadratic expression here. Unfortunately, the eigenvalues that result from the characteristic equation can be zero, repeated (nonunique) values, or even complex numbers. However,

4.8 Eigen-Analysis of Matrices

163

all symmetric matrices like the 3 × 3 example above are guaranteed to have real-valued eigenvalues. Eigenvalues and eigenvectors are associated. That is, for each eigenvector of a given matrix X there is exactly one corresponding eigenvalue such that λ= h Xh . hh

This uniqueness, however, is asymmetric. For each eigenvalue of the matrix there is an inﬁnite number of eigenvectors, all determined by scalar multiplication: If h is an eigenvector corresponding to the eigenvalue λ, then sh is also an eigenvector corresponding to this same eigenvalue where s is any nonzero scalar. There are many interesting matrix properties related to eigenvalues. For instance, the number of nonzero eigenvalues is the rank of the X, the sum of the eigenvalues is the trace of X, and the product of the eigenvalues is the determinant of X. From these principles it follows immediately that a matrix is singular if and only if it has a zero eigenvalue, and the rank of the matrix is the number of nonzero eigenvalues.

Properties of Eigenvalues for a Nonsingular (n × n) Matrix Inverse Property If λi is an eigenvalue of X, then 1 −1 λi is an eigenvalue of X X and X have the same eigenvalues For I, λi = n

Transpose Property Identity Matrix Exponentiation

If λi is an eigenvalue of X, thenλk is an i eigenvalue of Xk and k a positive integer

It is also true that if there are no zero-value eigenvectors, then the eigen-

164

Linear Algebra Continued: Matrix Structure

values determine a basis for the space determined by the size of the matrix (R2 , R3 , etc.). Even more interestingly, symmetric nonsingular matrices have eigenvectors that are perpendicular (see the Exercises). A notion related to eigenvalues is matrix conditioning. For a symmetric deﬁnite matrix, the ratio of the largest eigenvalue to the smallest eigenvalue is the condition number. If this number is large, then we say that the matrix is “ill-conditioned,” and it usually has poor properties. For example, if the matrix is nearly singular (but not quite), then the smallest eigenvalue will be close to zero and the ratio will be large for any reasonable value of the largest eigenvalue. As an example of this problem, in the use of matrix inversion to solve systems of linear equations, an ill-conditioned A matrix means that small changes in A will produce large changes in A−1 and therefore the calculation of x will differ dramatically.

Example 4.9:

Analyzing Social Mobility with Eigens. Duncan (1966)

analyzed social mobility between categories of employment (from the 1962 Current Population Survey) to produce probabilities for blacks and whites [also analyzed in McFarland (1981) from which this discussion is derived]. This well-known ﬁnding is summarized in two transition matrices, indicating probabilities for changing between higher white collar, lower white
collar, higher manual, lower manual, and farm :

where the rows and columns are in the order of employment categories given. So, for instance, 0.576 in the ﬁrst row and ﬁrst column of the W matrix means that we expect 57.6% of the children of white higher white collar workers will themselves become higher white collar workers. Contrastingly, 0.573 in the ﬁrst row and fourth column of the B matrix means that we expect 57.4% of the children of black lower manual workers to become lower manual workers themselves. A lot can be learned by staring at these matrices for some time, but what tools will let us understand long-run trends built into the data? Since these are transition probabilities, we could multiply one of these matrices to itself a large number of times as a simulation of future events (this is actually the topic of Chapter 9). It might be more convenient for answering simple questions to use eigenanalysis to pull structure out of the matrix instead. It turns out that the eigenvector produced from Xhi = λi hi is the right eigenvector because it sits on the right-hand side of X here. This is the default, so when an eigenvector is referenced without any qualiﬁers, the form derived above is the appropriate one. However, there is also the lesscommonly used left eigenvector produced from hi X = λi hi and so-named for the obvious reason. If X is a symmetric matrix, then the two vectors are identical (the eigenvalues are the same in either case). If X is not symmetrical, they differ, but the left eigenvector can be produced from using the transpose: X hi = λi hi . The spectral component corresponding to the ith eigenvalue is the square matrix produced from the cross product of the right and left eigenvectors over the dot product of the right and left

where only a single row of this 5 × 5 matrix is given here because all rows are identical (a result of λ1 = 1). The spectral values corresponding to the ﬁrst eigenvalue give the long-run (stable) proportions implied by the matrix probabilities. That is, if conditions do not change, these will be the eventual population proportions. So if the mobility trends persevere, eventually a little over two-thirds of the black population will be in lower manual occupations, and less than 10% will be in each of the white collar occupational categories (keep in mind that Duncan collected the data before the zenith of the civil rights movement). In contrast, for whites, about 40% will be in the higher white collar category with 15–20% in each of the other nonfarm occupational groups. Subsequent spectral components from declining eigenvalues give weighted propensities for movement between individual matrix categories. The second eigenvalue produces the most important indicator, followed by the third, and so on. The second spectral components corresponding to the second eigenvalues λ2,black = 0.177676, λ2,white = 0.348045 are

Notice that the full matrix is given here because the rows now differ. McFarland noticed the structure highlighted here with the boxes containing positive values. For blacks there is a tendency for white collar status and higher manual to be self-reinforcing: Once landed in the upper left 2 × 3 submatrix, there is a tendency to remain and negative inﬂuences on leaving. The same phenomenon applies for blacks to manual/farm labor: Once there it is more difﬁcult to leave. For whites the phenomenon is the same, except this barrier effect puts higher manual in the less desirable block. This suggests a racial differentiation with regard to higher manual occupations.

4.9 Quadratic Forms and Descriptions This section describes a general attribute known as deﬁniteness, although this term means nothing on its own. The central question is what properties does an n×n matrix X possess when pre- and post-multiplied by a conformable nonzero vector y ∈ Rn . The quadratic form of the matrix X is given by y Xy = s, where the result is some scalar, s. If s = 0 for every possible vector y, then X can only be the null matrix. But we are really interested in more nuanced

We can also say that X is indeﬁnite if it is neither nonnegative deﬁnite nor nonpositive deﬁnite. The big result is worth stating with emphasis: A positive deﬁnite matrix is always nonsingular. Furthermore, a positive deﬁnite matrix is therefore invertible and the resulting inverse will also be positive deﬁnite. Positive semideﬁnite matrices are sometimes singular and sometimes not. If such a matrix is nonsingular, then its inverse is also nonsingular. One theme that we keep returning to is the importance of the diagonal of a matrix. It turns out that every diagonal element of a positive deﬁnite matrix is positive, and every element of a negative deﬁnite matrix is negative. In addition, every element of a positive semideﬁnite matrix is nonnegative, and every element of a negative semideﬁnite matrix is nonpositive. This makes sense because we can switch properties between “negativeness” and “positiveness” by simply multiplying the matrix by −1. Example 4.10: LDU Decomposition. In the last chapter we learned

about LU decomposition as a way to triangularize matrices. The vague

4.9 Quadratic Forms and Descriptions

169

caveat at the time was that this could be done to “many” matrices. The condition, unstated at the time, is that the matrix must be nonsingular. We now know what that means, so it is now clear when LU decomposition is possible. More generally, though, any p × q matrix can be decomposed as follows: ⎡ ⎤ Dr×r 0 ⎦, where D=⎣ A = L D U , (p×q) (p×p)(p×q)(q×q) 0 0 where L (lower triangular) and U (upper triangular) are nonsingular (even given a singular matrix A). The diagonal matrix Dr×r is unique and has dimension and rank r that corresponds to the rank of A. If A is positive deﬁnite, and symmetric, then Dr×r = D (i.e., r = q) and A = LDL with unique L. For example, consider the LDU decomposition of the 3 × 3 unsymmetric, positive deﬁnite matrix A: ⎡ ⎤

Prove that tr(XY) = tr(X)tr(Y), except for special cases. In their formal study of models of group interaction, Bonacich and Bailey (1971) looked at linear and nonlinear systems of equations (their interest was in models that include factors such as free time, psychological compatibility, friendliness, and common interests). One of their conditions for a stable system was that the determinant of the matrix ⎛ ⎜ ⎜ ⎜ ⎝ ⎞

−r 0 1

a −r

⎟ ⎟ a ⎟ ⎠ 0 −r

0

must have a positive determinant for values of r and a. What is the 1 Find the eigenvalues of A = ⎣ 2 arithmetic relationship that must exist for this to⎡ true. ⎤ be ⎡ ⎤ 3 1 ⎦ and A = ⎣ 4 2 ⎦. −1 4

Land (1980) develops a mathematical theory of social change based on a model of underlying demographic accounts. The corresponding population mathematical models are shown to help identify and track changing social indicators, although no data are used in the article. Label Lx as the number of people in a population that are between x and x+1 years old. Then the square matrix P of order (ω+1)×(ω+1) is given by ⎡ ⎢ ⎢ L /L ⎢ 1 0 ⎢ ⎢ 0 ⎢ P =⎢ ⎢ 0 ⎢ ⎢ ⎢ . . ⎢ . ⎣ 0 0 0 0 L2 /L1 0 .. .. .. 0 0 0 0 L2 /L1 .. . 0 0 0 0 0 ... ⎥ ... ⎥ ⎥ ⎥ ⎥ ... ⎥ ⎥, ⎥ ... ⎥ ⎥ ⎥ ⎥ ⎦ ... ⎤

0 Lω /Lω−1

where ω is the assumed maximum lifespan and each of the nonzero ratios gives the proportion of people living to the next age. The matrix (I − P ) is theoretically important. Calculate its trace and inverse. The inverse will be a lower triangular form with survivorship probabilities as the nonzero values, and the column sums are standard life expectations in the actuarial sense. 4.14 The Clement matrix is a symmetric, tridiagonal matrix with zero diagonal values. It is sometimes used to test algorithms for computing inverses and eigenvalues. Compute the eigenvalues of the following

x − y + 2z = 2 4x + y − 2z = 10 x + 3y + z = 0. 4.20 4.21 Show that the eigenvectors from the matrix [ 2 1 ] are perpendicular. 12 A matrix is an M-matrix if xij ≤ 0, ∀i = j, and all the elements of the inverse (X −1 ) are nonnegative. Construct an example. 4.22 Williams and Grifﬁn (1964) looked at executive compensation in the following way. An allowable bonus to managers, B, is computed as a percentage of net proﬁt, P , before the bonus and before income taxes, T . But a reciprocal relationship exists because the size of the bonus affects net proﬁt, and vice versa. They give the following example as a system of equations. Solve. B 0.50B − − 0.10P 0.50P P + + 0.10T T =0 =0 = 100, 000.

where x.1 is percent change in the money supply a year ago (narrow), x.2 is percent change in the money supply a year ago (broad), x.3 is the 3-month money market rate (latest), x.4 is the 3-month money market rate (1 year ago), x.5 is the 2-year government bond rate, x.6 is the 10-year government bond rate (latest), x.7 is the 10-year government bond rate (1 year ago), and x.8 is the corporate bond rate (source: The Economist, January 29, 2005, page 97). We would expect a number of these ﬁgures to be stable over time or to relate across industrialized democracies. Test whether this makes the matrix X X ill-conditioned by obtaining the condition number. What is the rank of X X. Calculate the determinant using eigenvalues. Do you expect near collinearity here? 4.24 Show that the inverse relation for the matrix A below is true: ⎡ ⎤ ⎤−1 ⎡ −b d a b e ⎦ ⎦ =⎣ e A−1 = ⎣ . a −c c d e e Here e is the determinant of A. Now apply this rule to invert the 2 × 2 matrix X X from the n × 2 matrix X, which has a leading column of 1’s and a second column vector: [x11 , x12 , . . . , x1n ].

Exercises 4.25

177

Another method for solving linear systems of equations of the form A−1 y = x is Cramer’s rule. Deﬁne Aj as the matrix where y is plugged in for the jth column of A. Perform this for every column 1, . . . , q to produce q of these matrices, and the solution will be the vector
|Aq | |A1 | |A2 | A , A ,... A

. Show that performing these steps on the

matrix in the example on page 159 gives the same answer.

5
Elementary Scalar Calculus

5.1 Objectives This chapter introduces the basics of calculus operating on scalar quantities, and most of these principles can be understood quite readily. Many ﬁnd that the language and imagery of calculus are a lot more intimidating than the actual material (once they commit themselves to studying it). There are two primary tools of calculus, differentiation and integration, both of which are introduced here. A further chapter gives additional details and nuances, as well as an explanation of calculus on nonscalar values like vectors and matrices.

5.2 Limits and Lines The ﬁrst important idea on the way to understanding calculus is that of a limit. The key point is to see how functions behave when some value is made: arbitrarily small or arbitrarily large on some measure, or arbitrarily close to some ﬁnite value. That is, we are speciﬁcally interested in how a function tends to or converges to some point or line in the limit. 178

the domain (support) [0 : 4]. This function is unimodal, possessing a single maximum point, and it is symmetric, meaning that the shape is mirrored on either side of the line through the middle (which is the mode here). This function is graphed in the ﬁgure at the right.

−1

0

1

2

−1

0

1

2

3

4

5

What happens as the function approaches this mode from either direction? Consider “creeping up” on the top of the hill from either direction, as tabulated: Left x f (x) x f (x) 1.8000 2.9600 2.2000 2.9600 1.9000 2.9900 2.1000 2.9900 1.9500 2.9975 2.0500 2.9975 1.9900 2.9999 2.0100 2.9999

It should be obvious from the graph as well as these listed values that the limit of the function as x → 2 is 3, approached from either direction. This is denoted lim f (x) = 3 for the general case, as well as lim f (x) = 3 for
x→2 x→2−

approaching from the left and lim f (x) = 3 for approaching from the right.
x→2+

The reason that the right-hand limit and the left-hand limit are equal is that the function is continuous at the point of interest. If the function is smooth (continuous at all points; no gaps and no “corners” that cause problems here), then the left-hand limit and the right-hand limit are always identical except for at inﬁnity. Let us consider a more interesting case. Can a function have a ﬁnite limiting value in f (x) as x goes to inﬁnity? The answer is absolutely yes, and this turns out to be an important principle in understanding some functions.

Right

180

Elementary Scalar Calculus
Fig. 5.2. f (x) = 1 + 1/x2
4 −1 0 1 2 3

An interesting new function is f (x) = 1 + 1/x over the domain (0:∞+ ). Note that this function’s range is over the positive real numbers greater than or equal to one because of the square placed here on x. Again, the function is graphed in the ﬁgure at right showing the line at y = 1. What happens as this function approaches inﬁnity
2

0

2

4

6

8

10

from from the left? Obviously it does not make sense to approach inﬁnity from the right! Consider again tabulated the values:
x f (x) 1 2.0000 2 1.2500 3 1.1111 6 1.0278 12 1.0069 24 1.0017 100 1.0001

Once again the effect is not subtle. As x gets arbitrarily large, f (x) gets progressively closer to 1. The curve approaches but never seems to reach f (x) = 1 on the graph above. What occurs at exactly ∞ though? Plug ∞ into the function and see what results: f (x) = 1+1/∞ = 1 (1/∞ is deﬁned as zero because 1 divided by progressively larger numbers gets progressively smaller and inﬁnity is the largest number). So in the limit (and only in the limit) the function reaches 1, and for every ﬁnite value the curve is above the horizontal line at one. We say here that the value 1 is the asymptotic value of the function f (x) as x → ∞ and that the line y = 1 is the asymptote: lim f (x) = 1.
x→∞

There is another limit of interest for this function. What happens at x = 0? Plugging this value into the function gives f (x) = 1 + 1/0. This produces a result that we cannot use because dividing by zero is not deﬁned, so the function has no allowable value for x = 0 but does have allowable values for every positive x. Therefore the asymptotic value of f (x) with x approaching zero from the right is inﬁnity, which makes the vertical line y = 0 an asymptote

5.2 Limits and Lines of a different kind for this function: lim+ f (x) = ∞.
x→0

5.3 Understanding Rates, Changes, and Derivatives So why is it important to spend all that time on limits? We now turn to the deﬁnition of a derivative, which is based on a limit. To illustrate the discussion we will use a formal model from sociology that seeks to explain thresholds in voluntary racial segregation. Granovetter and Soong (1988) built on the foundational work of Thomas Schelling by mathematizing the idea that members of a racial group are progressively less likely to remain in a neighborhood as the proportion of another racial group rises. Assuming just blacks and whites, we can deﬁne the following terms: x is the percentage of whites, R is the “tolerance” of whites for the ratio of whites to blacks, and Nw is the total number of whites living in the neighborhood. In Granovetter and Soong’s model, the function f (x) deﬁnes a mobility frontier whereby an absolute number of blacks above the frontier causes whites to move out and an absolute number of blacks below the frontier causes whites to move in (or stay). They then developed and justiﬁed the function: f (x) = R 1 − x x, Nw

increases sharply moving right from zero, hits a maximum at 125, and then

5.3 Understanding Rates, Changes, and Derivatives

183

decreases back to zero. This means that the tolerated level was monotonically increasing (constantly increasing or staying the same, i.e., nondecreasing) until the maxima and then monotonically decreasing (constantly decreasing or staying the same, i.e., nonincreasing) until the tolerated level reaches zero.
Fig. 5.3. Describing the Rate of Change

120

100

f(x) = 5(1 − x 100)x

f(x) = 5 − x 10
0 20 40 60 80 100

60

40

20

67.2
16

0

70.5

73.8

77

80 83

80

88.5

18

20
x

22

24

x

We are actually interested here in the rate of change of the tolerated number as opposed to the tolerated number itself: The rate of increase steadily declines from an earnest starting point until it reaches zero; then the rate of decrease starts slowly and gradually picks up until the velocity is zero. This can be summarized by the following table (recall that ∈ means “an element of”).

Region x ∈ (0:50] x = 50 x ∈ [50:100)

Speed increasing maximum decreasing

Rate of Change decreasing zero increasing

Say that we are interested in the rate of change at exactly time x = 20, which is the point designated at coordinates (20, 80) in the ﬁrst panel of Figure 5.3. How would we calculate this? A reasonable approximation can be made with line segments. Speciﬁcally, starting 4 units away from 20 in either direction,

184

Elementary Scalar Calculus

go 1 unit along the x-axis toward the point at 20 and construct line segments connecting the points along the curve at these x levels. The slope of the line segment (easily calculated from Section 1.5.1) is therefore an approximation to the instantaneous rate at x = 20, “rise-over-run,” given by the segment m= f (x2 ) − f (x1 ) . x2 − x1

So the ﬁrst line segment of interest has values x1 = 16 and x2 = 24. If we call the width of the interval h = x2 − x1 , then the point of interest, x, is at the center of this interval and we say m= = f x+
h 2

−f x− h

h 2

f (x + h) − f (x) h

because f (h/2) can move between functions in the numerator. This segment is shown as the lowest (longest) line segment in the second panel of Figure 5.3 and has slope 2.6625. In fact, this estimate is not quite right, but it is an average of a slightly faster rate of change (below) and a slightly slower rate of change (above). Because this is an estimate, it is reasonable to ask how we can improve it. The obvious idea is to decrease the width of the interval around the point of interest. First go to 17–23 and then 18–22, and construct new line segments and therefore new estimates as shown in the second panel of Figure 5.3. At each reduction in interval width we are improving the estimate of the instantaneous rate of change at x = 20. Notice the nonlinear scale on the y-axis produced by the curvature of the function. When should we stop? The answer to this question is found back in the previous discussion of limits. Deﬁne h again as the length of the intervals created as just described and call the expression for the slope of the line segment m(x), to distinguish the slope form from the function itself. The point where
h→0

lim occurs is the point where we get exactly the instantaneous rate of change

at x = 20 since the width of the interval is now zero, yet it is still “centered”

5.3 Understanding Rates, Changes, and Derivatives

185

around (20, 80). This instantaneous rate is equal to the slope of the tangent line (not to be confused with the tangent trigonometric function from Chapter 2) to the curve at the point x: the line that touches the curve only at this one point. It can be shown that there exists a unique tangent line for every point on a smooth curve. So let us apply this logic to our function and perform the algebra very mechanically:
h→0

lim m(x) = lim

h→0

f (x + h) − f (x) h 5(x + h) − 5h − 5h −
1 2 20 (x 1 20 (x

= lim

h→0

+ h)2 − 5x − h
1 2 20 x

1 2 20 x

= lim = lim

h→0 2 20 xh

+ 2xh + h2 ) + h −
1 2 20 h

h→0

h

= lim

h→0

5−

1 1 x− h 10 20

=5−

1 x. 10

This means that for any allowable x point we now have an expression for the instantaneous slope at that point. Label this with a prime to clarify that it is a
1 different, but related, function: f (x) = 5 − 10 x. Our point of interest is 20, so

f (20) = 3. Figure 5.4 shows tangent lines plotted at various points on f (x). Note that the tangent line at the maxima is “ﬂat,” having slope zero. This is an important principle that we will make extensive use of later. What we have done here is produce the derivative of the function f (x), denoted f (x), also called differentiating f (x). This derivative process is fundamental and has the deﬁnition f (x) = lim f (x + h) − f (x) . h→0 h
d dx f (x),

The derivative expression f (x) is Euler’s version of Newton’s notation, but it is often better to use Leibniz’s notation which resembles the limit derivation we just performed, substituting ∆x = h. The change (delta) in x is

therefore d ∆f (x) df (x) f (x) = = lim . ∆x→0 ∆x dx dx This latter notation for the derivative is generally preferred because it better reﬂects the change in the function f (x) for an inﬁnitesimal change in x, and it is easier to manipulate in more complex problems. Also, note that the fractional form of Leibniz’s notation is given in two different ways, which are absolutely equivalent: d du u= , dx dx for some function u = f (x). Having said all that, Newton’s form is more compact and looks nicer in simple problems, so it is important to know each form because they are both useful.

5.3 Understanding Rates, Changes, and Derivatives To summarize what we have done so far:

187

Summary of Derivative Theory Existence f (x) at x exists iff f (x) is continuous at x, and there is no point where the right-hand derivative and the left-hand derivative are different f (x) = lim f (x+h)−f (x) h
h→0

Deﬁnition Tangent Line

f (x) is the slope of the line tangent to f () at x; this is the limit of the enclosed secant lines

The second existence condition needs further explanation. This is sometimes call the “no corners” condition because these points are geometric corners of the function and have the condition that
∆x→0

lim −

∆f (x) ∆f (x) = lim + . ∆x ∆x ∆x→0

That is, taking these limits to the left and to the right of the point produces different results. The classic example is the function f (x) = |x|, which looks like a “V” centered at the origin. So inﬁnitesimally approaching (0, 0) from the left, ∆x → 0− , is different from inﬁnitesimally approaching (0, 0) from the right, ∆x → 0+ . Another way to think about this is related to Figure 5.4. Each of the illustrated tangent lines is uniquely determined by the selected point along the function. At a corner point the respective line would be allowed to “swing around” to an inﬁnite number of places because it is resting on a single point (atom). Thus no unique derivative can be speciﬁed. Example 5.5: Derivatives for Analyzing Legislative Committee Size.

Francis (1982) wanted to ﬁnd criteria for determining “optimal” committee sizes in Congress or state-level legislatures. This is an important question because committees are key organizational and procedural components of American legislatures. A great number of scholars of American politics

188

Elementary Scalar Calculus

have observed the central role that committee government plays, but not nearly as many have sought to understand committee size and subsequent efﬁciency. Efﬁciency is deﬁned by Francis as minimizing two criteria for committee members: • Decision Costs: (Yd ) the time and energy required for obtaining policy information, bargaining with colleagues, and actual meeting time. • External Costs: (Ye ) the electoral and institutional costs of producing nonconsensual committee decisions (i.e., conﬂict). Francis modeled these costs as a function of committee size in the following partly-linear speciﬁcations: Yd = ad + bd X g , Ye = ae + be X k , g>0 k < 0,

where the a terms are intercepts (i.e., the costs for nonexistence or nonmembership), the b terms are slopes (the relative importance of the multiplied term X) in the linear sense discussed in Section 1.5.1, and X is the size of the committee. The interesting terms here are the exponents on X. Since g is necessarily positive, increasing committee size increases decision costs, which makes logical sense. The term k is restricted to be negative, implying that the larger the committee, the closer the representation on the committee is to the full chamber or the electorate and therefore the lower such outside costs of committee decisions are likely to be to a member. The key point of the Francis model is that committee size is a trade-off between Yd and Ye since they move in opposite directions for increasing or decreasing numbers of members. This can be expressed in one single equation by asserting that the two costs are treated equally for members (an assumption that could easily be generalized by weighting the two functions differently) and adding the two cost equations: Y = Yd + Ye = ad + bd X g + ae + be X k .

5.4 Derivative Rules for Common Functions

189

So now we have a single equation that expresses the full identiﬁed costs to members as a function of X. How would we understand the effect of changing the committee size? Taking the derivative of this expression with respect to X gives the instantaneous rate of change in these political costs at levels of X, and understanding changes in this rate helps to identify better or worse committee sizes for known or estimated values of ad , bd , g, ae be , and also k. The derivative operating on a polynomial does the following: It multiplies the expression by the exponent, it decreases the exponent by one, and it gets rid of any isolated constant terms. These rules are further explained and illustrated in the next section. The consequence in this case is that the ﬁrst derivative of the Francis model is d Y = gbd X g−1 + kbe X k−1 , dx which allowed him to see the instantaneous effect of changes in committee size and to see where important regions are located. Francis thus found a minimum point by looking for an X value where
d dx Y

is equal to zero (i.e.,

the tangent line is ﬂat). This value of X (subject to one more qualiﬁcation to make sure that it is not a maximum) minimizes costs as a function of committee size given the known parameters.

5.4 Derivative Rules for Common Functions It would be pretty annoying if every time we wanted to obtain the derivative of a function we had to calculate it through the limit as was done in the last section. Fortunately there are a number of basic rules such that taking derivatives on polynomials and other functional forms is relatively easy.

5.4.1 Basic Algebraic Rules for Derivatives We will provide a number of basic rules here without proof. Most of them are fairly intuitive.

This is a good point in the discussion to pause and make sure that the six examples above are well understood by the reader as the power rule is perhaps the most fundamental derivative operation. The derivative of a constant is always zero: d k = 0, ∀k. dx This makes sense because a derivative is a rate of change and constants do not change. For example, d 2=0 dx (i.e., there is no change to account for in 2). However, when a constant is multiplying some function of x, it is immaterial to the derivative operation, but it has to be accounted for later: d d kf (x) = k f (x). dx dx As an example, the derivative of f (x) = 3x is simply 3 since component of the sum has a deﬁned derivative: d d d [f (x) + g(x)] = f (x) + g(x), dx dx dx and of course this rule is not limited to just two components in the sum: d dx
k d dx f (x)

is 1.

The derivative of a sum is just the sum of the derivatives, provided that each

Unfortunately, the product rule is a bit more intricate than these simple methods. The product rule is the sum of two pieces where in each piece one of the two multiplied functions is left alone and the other is differentiated: d d d [f (x)g(x)] = f (x) g(x) + g(x) f (x). dx dx dx This is actually not a very difﬁcult formula to remember due to its symmetry. As an example, we now differentiate the following: d (3x2 − 3)(4x3 − 2x) dx = (3x2 − 3) d d (4x3 − 2x) + (4x3 − 2x) (3x2 − 3) dx dx

But since quotients are products where one of the terms is raised to the −1 power, it is generally easier to remember and easier to execute the product rule with this adjustment: d f (x) d = f (x)g(x)−1 dx g(x) dx = f (x) d d g(x)−1 + g(x)−1 f (x). dx dx
d −1 dx g(x)

This would be ﬁne, but we do not yet know how to calculate (4x3 − 2x)−1 . Such a calculation requires the chain rule.

in

general since there are nested components that are functions of x: g(x)−1 = The chain rule provides a means of differentiating nested functions. In Chapter 1, on page 20, we saw nested functions of the form f ◦ g = f (g(x)). The case of g(x)−1 = (4x3 − 2x)−1 ﬁts this categorization because the inner function is g(x) = 4x3 −2x and the outer function is f (u) = u−1 . Typically u is used as a placeholder here to make the point that there is a distinct subfunction. To correctly differentiate such a nested function, we have to account for the actual order of the nesting relationship. This is done by the chain rule, which is given by d f (g(x)) = f (g(x))g (x), dx provided of course that f (x) and g(x) are both differentiable functions. We can also express this in the other standard notation. If y = f (u) and u = g(x) are both differentiable functions, then dy dy du = , dx du dx

5.4 Derivative Rules for Common Functions

193

which may better show the point of the operation. If we think about this in purely fractional terms, it is clear that du cancels out of the right-hand side, making the equality obvious. Let us use this new tool to calculate the function g(x)−1 from above (g(x) = 4x3 − 2x):

So we see that applying the chain rule ends up being quite mechanical. In many cases one is applying the chain rule along with other derivative rules in the context of the same problem. For example, suppose we want the derivative

Note that another application of the chain rule was required with the inner step because of the term (y 2 + 1)−1 . While this is modestly inconvenient, it comes from the more efﬁcient use of the product rule with the second term raised to the −1 power rather than the use of the quotient rule. Example 5.6: Productivity and Interdependence in Work Groups.

We might ask, how does the productivity of workers affect total organizational productivity differently in environments where there is a great amount of interdependence of tasks relative to an environment where tasks are processed independently? Stinchcombe and Harris (1969) developed a mathematical model to explain such productivity differences and the subsequent effect of greater supervision. Not surprisingly, they found that the effect of additional supervision is greater for work groups that are more interdependent in processing tasks. Deﬁne T1 as the total production when each person’s task is independent and T2 as the total production when every task is interdependent. Admittedly, these are extreme cases, but the point is to show differences, so they are likely to be maximally revealing.

5.4 Derivative Rules for Common Functions In the independent case, the total organizational productivity is
n

195

T1 =
j=1

bpj = nb¯, p

where pj is the jth worker’s performance, b is an efﬁciency constant for the entire organization that measures how individual performance contributes to total performance, and there are n workers. The notation p indicates the ¯ average (mean) of all the pj workers. In the interdependent case, we get instead
n

K
j=1

pj ,

where K is the total rate of production when everyone is productive. What the product notation (explained on page 12) shows here is that if even one worker is totally unproductive, pj = 0, then the entire organization is unproductive. Also, productivity is a function of willingness to use ability at each worker’s level, so we can deﬁne a function pj = f (xj ) that relates productivity to ability. The premise is that supervision affects this function by motivating people to use their abilities to the greatest extent possible. Therefore we are interested in comparing ∂T1 ∂xj and ∂T2 . ∂xj

But we cannot take this derivative directly since T is a function of p and p is a function of x. Fortunately the chain rule sorts this out for us: ∂T1 ∂pj T1 ∂pj = ∂pj ∂xj n¯ ∂xj p ∂T2 ∂pj T2 ∂pj = . ∂pj ∂xj pj ∂xj

The interdependent case is simple because b = T1 /n¯ is how much any one p of the individuals contributes through performance, and therefore how much the jth worker contributes. The interdependent case comes from dividing out the pj th productivity from the total product. The key for comparison was

196

Elementary Scalar Calculus

that the second fraction in both expressions (∂pj /∂xj ) is the same, so the effect of different organizations was completely testable as the ﬁrst fraction only. Stinchcombe and Harris’ subsequent claim was that “the marginal productivity of the jth worker’s performance is about n times as great in the interdependent case but varies according to his absolute level of performance” since the marginal for the interdependent case wass dependent on a single worker and the marginal for the independent case was dependent only on the average worker. So those with very low performance were more harmful in the interdependent case than in the independent case, and these are the cases that should be addressed ﬁrst.

5.4.2 Derivatives of Logarithms and Exponents We have already seen the use of real numbers as exponents in the derivative process. What if the variable itself is in the exponent? For this form we have d x n = log(n)nx dx and d f (x) df (x) n , = log(n)nx dx dx

where the log function is the natural log (denoted logn () or ln()). So the derivative “peels off” a value of n that is “logged.” This is especially handy when n is the e value, since d x e = log(e)ex = ex , dx meaning that ex is invariant to the derivative operation. But for compound functions with u = f (x), we need to account for the chain rule: du d u e = eu , dx dx for u a function of the x. Relatedly, the derivatives of the logarithm are given by d 1 log(x) = , dx x

That was pretty simple actually, so now perform a more complicated calculation. The procedure logarithmic differentiation uses the log function to make the process of differentiating a difﬁcult function easier. The basic idea is to log the function, take the derivative of this version, and then compensate back at the last step. Start with the function: y= (3x2 − 3) 3 (4x − 2) 4 (x2 + 1) 2
1 1 1

,

which would be quite involved using the product rule, the quotient rule, and the chain rule. So instead, let us take the derivative of log(y) = 1 1 1 log(3x2 − 3) + log(4x − 2) − log(x2 + 1) 3 4 2

(note the minus sign in front of the last term because it was in the denominator). Now we take the derivative and solve on this easier metric using the additive property of derivation rather than the product rule and the quotient rule. The

(1994) evaluated the decisions that nations make in seeking security through building their armed forces and seeking alliances with other nations. A standard theory in the international relations literature asserts that nations (states) form alliances predominantly to protect themselves from threatening states (Walt 1987, 1988). Thus they rely on their own armed services as well as the armed services of other allied nations as a deterrence from war. However, as Sorokin pointed out, both arms and alliances are costly, and so states will seek a balance that maximizes the security beneﬁt from necessarily limited resources. How can this be modeled? Consider a state labeled i and its erstwhile ally labeled j. They each have military capability labeled Mi and Mj , correspondingly. This is a convenient simpliﬁcation that helps to construct an illustrative model, and it includes such factors as the numbers of soldiers, quantity and quality of military hardware, as well as geographic constraints.

5.4 Derivative Rules for Common Functions

199

It would be unreasonable to say that just because i had an alliance with j it could automatically count on receiving the full level of Mj support if attacked. Sorokin thus introduced the term T ∈ [0 :1], which indicates the “tightness” of the alliance, where higher values imply a higher probability of country j providing Mj military support or the proportion of Mj to be provided. So T = 0 indicates no military alliances whatsoever, and values very close to 1 indicate a very tight military alliance such as the heyday of NATO and the Warsaw Pact. The variable of primary interest is the amount of security that nation i receives from the combination of their military capability and the ally’s capability weighted by the tightness of the alliance. This term is labeled Si and is deﬁned as Si = log(Mi + 1) + T log(Mj + 1). The logarithm is speciﬁed because increasing levels of military capability are assumed to give diminishing levels of security as capabilities rise at higher levels, and the 1 term gives a baseline. So if T = 0.5, then one unit of Mi is equivalent to two units of Mj in security terms. But rather than simply list out hypothetical levels for substantive analysis, it would be more revealing to obtain the marginal effects of each variable, which are the individual contributions of each term. There are three quantities of interest, and we can obtain marginal effect equations for each by taking three individual ﬁrst derivatives that provide the instantaneous rate of change in security at chosen levels. Because we have three variables to keep track of, we will use slightly different notation in taking ﬁrst derivatives. The partial derivative notation replaces “d” with “∂” but performs exactly the same operation. The replacement is just to remind us that there are other random quantities in the equation and we have picked just one of them to differentiate with this particular expression (more on this in Chapter 6). The three marginal effects

What can we learn from this? The marginal effects of Mi and Mj are declining with increases in level, meaning that the rate of increase in security decreases. This shows that adding more men and arms has a diminishing effect, but this is exactly the motivation for seeking a mixture of arms under national command and arms from an ally since limited resources will then necessarily leverage more security. Note also that the marginal effect of Mj includes the term T . This means that this marginal effect is deﬁned only at levels of tightness, which makes intuitive sense as well. Of course the reverse is also true since the marginal effect of T depends as well on the military capability of the ally.

5.4.3 L’Hospital’s Rule The early Greeks were wary of zero and the Pythagoreans outlawed it. Zero causes problems. In fact, there have been times when zero was considered an “evil” number (and ironically other times when it was considered proof of the existence of god). One problem, already mentioned, caused by zero is when it ends up in the denominator of a fraction. In this case we say that the fraction is “undeﬁned,” which sounds like a nonanswer or some kind of a dodge. A certain conundrum in particular is the case of 0/0. The seventh-century Indian mathematician Brahmagupta claimed it was zero, but his mathematical heirs, such as Bhaskara in the twelfth century, believed that 1/0 must be inﬁnite and yet it would be only one unit away from 0/0 = 0, thus producing a paradox.

5.4 Derivative Rules for Common Functions

201

Fortunately for us calculus provides a means of evaluating the special case of 0/0. Assume that f (x) and g(x) are differentiable functions at a where f (a) = 0 and g(a) = 0. L’Hospital’s rule states that
x→a g(x)

lim

f (x)

= lim

x→a g

f (x) , (x)

provided that g (x) = 0. In plainer words, the limit of the ratio of their two functions is equal to the limit of the ratio of the two derivatives. Thus, even if the original ratio is not interpretable, we can often get a result from the ratio of the derivatives. Guillaume L’Hospital was a wealthy French aristocrat who studied under Johann Bernoulli and subsequently wrote the world’s ﬁrst calculus textbook using correspondence from Bernoulli. L’Hospital’s rule is thus misnamed for its disseminator rather than its creator. As an example, we can evaluate the following ratio, which produces 0/0 at the point 0:
d x x = lim d dx x→0 x→0 log(1 − x) dx log(1 − x)

lim

= lim

x→0

1 −(1−x)

1

= −1.

L’Hospital’s rule can also be applied for the form ∞/∞: Assume that f (x) and g(x) are differentiable functions at a where f (a) = ∞ and g(a) = ∞; then again lim f (x)/g(x) = lim f (x)/g (x).
x→a x→a

Here is an example where this is handy. Note the repeated use of the product rule and the chain rule in this calculation:
d (log(x))2 (log(x))2 = lim dx 2 d 2 log(x) x→∞ x x→∞ dx x log(x)

lim

= lim

1 2 log(x) x x→∞ 2x log(x) + x2 1 x

= lim

x→∞ x2

log(x) . log(x) + 1 x2 2

202

Elementary Scalar Calculus

It seems like we are stuck here, but we can actually apply L’Hospital’s rule again, so after the derivatives we have = lim
1 x x→∞ 2x log(x) 1 + x2 x + x

= lim Example 5.8:

1 + 1)

x→∞ 2x2 (log(x)

= 0.

Analyzing an Inﬁnite Series for Sociology Data. Peter-

son (1991) wrote critically about sources of bias in models that describe durations: how long some observed phenomena lasts [also called hazard models or event-history models; see Box-Steffensmeier and Jones (2004) for a review]. In his appendix he claimed that the series deﬁned by aj,i = ji × exp(−αji ), α > 0, ji = 1, 2, 3, . . . ,

goes to zero in the limit as ji continues counting to inﬁnity. His evidence is the application of L’Hospital’s rule twice: ji 1 0 . exp(αji )

ji →∞ exp(αji )

lim

= lim

ji →∞ α exp(αji )

= lim

ji →∞ α2

Did we need the second application of L’Hospital’s rule? It appears not, because after the ﬁrst iteration we have a constant in the numerator and positive values of the increasing term in the denominator. Nonetheless, it is no less true and pretty obvious after the second iteration.

5.4.4 Applications: Rolle’s Theorem and the Mean Value Theorem There are some interesting consequences for considering derivatives of functions that are over bounded regions of the x-axis. These are stated and explained here without proof because they are standard results.

5.4 Derivative Rules for Common Functions Rolle’s Theorem:

203

• Assume a function f (x) that is continuous on the closed interval [a:b] and differentiable on the open interval (a : b). Note that it would be unreasonable to require differentiability at the endpoints. • f (a) = 0 and f (b) = 0. x • Then there is guaranteed to be at least one point x in (a : b) such that f (ˆ) = ˆ 0. Think about what this theorem is saying. A point with a zero derivative is a minima or a maxima (the tangent line is ﬂat), so the theorem is saying that if the endpoints of the interval are both on the x-axis, then there must be one or more points that are modes or anti-modes. Is this logical? Start at the point [a, 0]. Suppose from there the function increased. To get back to the required endpoint at [b, 0] it would have to “turn around” somewhere above the x-axis, thus guaranteeing a maximum in the interval. Suppose instead that the function left [a, 0] and decreased. Also, to get back to [b, 0] it would have to also turn around somewhere below the x-axis, now guaranteeing a minimum. There is one more case that is pathological (mathematicians love reminding people about these). Suppose that the function was just a ﬂat line from [a, 0] to [b, 0]. Then every point is a maxima and Rolle’s Theorem is still true. Now we have exhausted the possibilities since the function leaving either endpoint has to either increase, decrease, or stay the same. So we have just provided an informal proof! Also, we have stated this theorem for f (a) = 0 and f (b) = 0, but it is really more general and can be restated for f (a) = f (b) = k, with any constant k Mean Value Theorem: • Assume a function f (x) that is continuous on the closed interval [a:b] and differentiable on the open interval (a:b). • There is now guaranteed to be at least one point x in (a:b) such that f (b) − ˆ x f (a) = f (ˆ)(b − a).

204

Elementary Scalar Calculus

This theorem just says that between the function values at the start and ﬁnish of the interval there will be an “average” point. Another way to think about this is to rearrange the result as f (b) − f (a) = f (ˆ) x b−a so that the left-hand side gives a slope equation, rise-over-run. This says that the line that connects the endpoints of the function has a slope that is equal to the derivative somewhere inbetween. When stated this way, we can see that it comes from Rolle’s Theorem where f (a) = f (b) = 0. Both of these theorems show that the derivative is a fundamental procedure for understanding polynomials and other functions. Remarkably, derivative calculus is a relatively recent development in the history of mathematics, which is of course a very long history. While there were glimmers of differentiation and integration prior to the seventeenth century, it was not until Newton, and independently Leibniz, codiﬁed and integrated these ideas that calculus was born. This event represents a dramatic turning point in mathematics, and perhaps in human civilization as well, as it lead to an explosion of knowledge and understanding. In fact, much of the mathematics of the eighteenth and early nineteenth centuries was devoted to understanding the details and implications of this new and exciting tool. We have thus far visited one-half of the world of calculus by looking at derivatives; we now turn our attention to the other half, which is integral calculus.

5.5 Understanding Areas, Slices, and Integrals One of the fundamental mathematics problems is to ﬁnd the area “under” a curve, designated by R. By this we mean the area below the curve given by a smooth, bounded function, f (x), and above the x-axis (i.e., f (x) ≥ 0, ∀x ∈ [a:b]). This is illustrated in Figure 5.5. Actually, this characterization is a bit too restrictive because other areas in the coordinate axis can also be measured and we will want to treat unbounded or discontinuous areas as well, but we will

5.5 Understanding Areas, Slices, and Integrals

205

stick with this setup for now. Integration is a calculus procedure for measuring areas and is as fundamental a process as differentiation.

5.5.1 Riemann Integrals So how would we go about measuring such an area? Here is a really mechanical and fundamental way. First “slice up” the area under the curve with a set of bars that are approximately as high as the curve at different places. This would then be somewhat like a histogram approximation of R where we simply sum up the sizes of the set of rectangles (a very easy task). This method is sometimes referred to as the rectangle rule but is formally called Riemann integration. It is the simplest but least accurate method for numerical integration. More formally, deﬁne n disjoint intervals along the x-axis of width h = (b − a)/n so that the lowest edge is x0 = a, the highest edge is xn = b, and for i = 2, . . . , n − 1, xi = a + ih, produces a histogram-like approximation of R. The key point is that for the ith bar the approximation of f (x) over h is f (a + ih). The only wrinkle here is that one must select whether to employ “left” or “right” Riemann integration:
n−1

h
i=0 n

f (a + ih), left Riemann integral

h
i=1

f (a + ih), right Riemann integral,

determining which of the top corners of the bars touches the curve. Despite the obvious roughness of approximating a smooth curve with a series of rectangular bars over regular bins, Riemann integrals can be extremely useful as a crude starting point because they are easily implemented. Figure 5.5 shows this process for both left and right types with the different indexing strategies for i along the x-axis for the function: ⎧ ⎪ ⎨(6 − θ)2 /200 + 0.011 for θ ∈ [0 : 6) p(θ) = ⎪ ⎩C(11, 2)/2 for θ ∈ [6 : 12],

It is evident from the two graphs that when the function is downsloping, as it is on the left-hand side, the left Riemann integral overestimates and the right Riemann integral underestimates. Conversely when the function is upsloping, as it is toward the right-hand-side, the left Riemann integral underestimates and the right Riemann integral overestimates. For the values given, the left Riemann integral is too large because there is more downsloping in the bounded region, and the right Riemann integral is too small correspondingly. There is a neat theorem that shows that the actual value of the area for one of these regions is bounded by the left and right Riemann integrals. Therefore, the true area under the curve is bounded by the two estimates given. Obviously, because of the inaccuracies mentioned, this is not the best procedure for general use. The value of the left Riemann integral is 0.7794 and the value of the right Riemann integral is 0.6816 for this example, and such a discrepancy is disturbing. Intuitively, as the number of bars used in this process increases, the smaller the regions of curve that we are under- or overestimating. This suggests making the width of the bars (h) very small to improve accuracy.

5.5 Understanding Areas, Slices, and Integrals

207

Such a procedure is always possible since the x-axis is the real number line, and we know that there are an inﬁnite number of places to set down the bars. It would be very annoying if every time we wanted to measure the area under a curve deﬁned by some function we had to create lots of these bars and sum them up. So now we can return to the idea of limit. As the number of bars increases over a bounded area, then necessarily the width of the bars decreases. So let the width of the bars go to zero in the limit, forcing an inﬁnite number of bars. It is not technically necessary, but continue to assume that all the bars are of equal size, so this limit result holds easily. We now need to be more formal about what we are doing. For a continuous function f (x) bounded by a and b, deﬁne the following limits for left and right Riemann integrals:
n−1

Sleft = lim h
h→0 i=0 n

f (a + ih)

Sright = lim h
h→0 i=1

f (a + ih),

where n is the number of bars, h is the width of the bars (and bins), and nh is required to cover the domain of the function, b − a. For every subregion the left and right Riemann integrals bound the truth, and these bounds necessarily get progressively tighter approaching the limit. So we then know that Sleft = Sright = R because of the effect of the limit. This is a wonderful result: The limit of the Riemann process is the true area under the curve. In fact, there is speciﬁc terminology for what we have done: The deﬁnite integral is given by
b

R=
a

f (x)dx,

where the

symbol is supposed to look somewhat like an "S" to remind us

that this is really just a special kind of sum. The placement of a and b indicate the lower and upper limits of the deﬁnite integral, and f (x) is now called

208

Elementary Scalar Calculus

the integrand. The ﬁnal piece, dx, is a reminder that we are summing over inﬁnitesimal values of x. So while the notation of integration can be intimidating to the uninitiated, it really conveys a pretty straightforward idea. The integral here is called “deﬁnite” because the limits on the integration are deﬁned (i.e., having speciﬁc values like a and b here). Note that this use of the word limit applies to the range of application for the integral, not a limit, in the sense of limiting functions studied in Section 5.2. 5.5.1.1 Application: Limits of a Riemann Integral Suppose that we want to evaluate the function f (x) = x2 over the domain [0:1] using this methodology. First divide the interval up into h slices each of width 1/h since our interval is 1 wide. Thus the region of interest is given by the limit of a left Riemann integral:
h

R = lim

h→∞

i=1

1 1 f (x)2 = lim (i/h)2 h→∞ h h i=1
h

h

= lim

1 h→∞ h3 1
h→∞ 6

i2 = lim
i=1

1 h(h + 1)(2h + 1) h→∞ h3 6

= lim

(2 +

3 1 1 + )= . h h2 3

The step out of the summation was accomplished by a well-known trick. Here it is with a relative, stated generically: n(n + 1)(2n + 1) , x = 6 x=1
2 n n

x=
x=1

n(n + 1) . 2

This process is shown in Figure 5.6 using left Riemann sums for 10 and 100 bins over the interval to highlight the progress that is made in going to the limit. Summing up the bin heights and dividing by the number of bins produces 0.384967 for 10 bins and 0.3383167 for 100 bins. So already at 100 bins we are ﬁtting the curve reasonably close to the true value of one-third.

5.6 The Fundamental Theorem of Calculus We start this section with some new deﬁnitions. In the last section principles of Riemann integration were explained, and here we extend these ideas. Since both the left and the right Riemann integrals produce the correct area in the limit as the number of hi = (xi − xi−1 ) goes to inﬁnity, it is clear that some point in between the two will also lead to convergence. Actually, it is immaterial which point we pick in the closed interval, due to the effect of the limiting operation. For slices i = 1 to H covering the full domain of f (x), deﬁne the point xi as ˆ an arbitrary point in the ith interval [xi−1 :xi ]. Therefore,
b H

f (x)dx = lim
a

h→0

f (ˆ)hi , x
i=1

and this is now called a Riemann sum as opposed to a Riemann integral. We need one more deﬁnition before proceeding. The process of taking a derivative has an opposite, the antiderivative. The antiderivative corresponding to a speciﬁc derivative takes the equation form back to its previous state.
1 So, for example, if f (x) = 3 x3 and the derivative is f (x) = x2 , then the an-

tiderivative of the function g(x) = x2 is G(x) = 1 x3 . Usually antiderivatives 3 are designated with a capital letter. Note that the derivative of the antiderivative

which simply says that integration and differentiation are opposite procedures: an integral of f (x) from a to b is just the antiderivative at b minus the antiderivative at a. This is really important theoretically, but it is also really important computationally because it shows that we can integrate functions by using antiderivatives rather than having to worry about the more laborious limit operations.

5.6.1 Integrating Polynomials with Antiderivatives The use of antiderivatives for solving deﬁnite integrals is especially helpful with polynomial functions. For example, let us calculate the following deﬁnite integral:
2 1

the meaning is obvious from the dy term (the distinction is more important in the next chapter, when we study integrals of more than one variable). Now we will summarize the basic properties of deﬁnite integrals. Properties of Deﬁnite Integrals Constants Additive Property
b a

The ﬁrst two properties are obvious by now and the third is just a combination of the ﬁrst two. The fourth property is much more interesting. It says that we can split up the deﬁnite integral into two pieces based on some intermediate value between the endpoints and do the integration separately. Let us now do this with the function f (x) = 2x−5/2 − x−9/2 dx integrated over [0.8 : 2.0] with an intermediate point at 1.25:
2.0 0.8

2x− 2 − x− 2 dx =

5

9

1.25 0.8

2x− 2 − x− 2 dx + 2 − − 7
−3 2

5

9

2.0 1.25

2x− 2 − x− 2 dx
1.25

5

9

=

2 − 3

2x

−3 2

x

7 −2

0.8 2.0

+

2 − 3

2x

2 − − 7

x

−7 2 1.25

= [−0.82321 − (−1.23948)] + [−0.44615 − (−0.82321)] = 0.79333. This is illustrated in Figure 5.7. This technique is especially handy where it is difﬁcult to integrate the function in one piece (the example here is therefore

214

Elementary Scalar Calculus

somewhat artiﬁcial). Such cases occur when there are discontinuities or pieces of the area below the x-axis. Example 5.9: The Median Voter Theorem. The simplest, most direct

analysis of the aggregation of vote preferences in elections is the Median Voter Theorem. Duncan Black’s (1958) early article identiﬁed the role of a speciﬁc voter whose position in a single issue dimension is at the median of other voters’ preferences. His theorem roughly states that if all of the voters’ preference distributions are unimodal, then the median voter will always be in the winning majority. This requires two primary restrictions. There must be a single issue dimension (unless the same person is the median voter in all relevant dimensions), and each voter must have a unimodal preference distribution. There are also two other assumptions generally of a less-important nature: All voters participate in the election, and all voters express their true preferences (sincere voting). There is a substantial literature that evaluates the median voter theorem after altering any of these assumptions [see Dion (1992), for example]. The Median Voter Theorem is displayed in Figure 5.8, which is a reproduction of Black’s ﬁgure (1958, p.15). Shown are the preference curves for ﬁve hypothetical voters on an interval measured issue space (the x-axis), where the utility goes to zero at two points for each voter (one can also assume that the utility curves asymptotically approach zero as Black did). In the case given here it is clear that the voter with the mode at O3 is the median voter for this system, and there is some overlap with the voter whose mode is at O2 . Since overlap represents some form of potential agreement, we might be interested in measuring this area. These utility functions are often drawn or assumed to be parabolic shapes. The general form used here is f (x) = 10 − (µi − x)2 ωi , where µi determines this voter’s modal value and ωi determines how fast their

utility diminishes moving away from the mode (i.e., how “fat” the curve is for this voter). For the two voters under study, the utility equations are therefore

V2 : f (x) = 10 − (3.5 − x)2 (6)

V3 : f (x) = 10 − (5 − x)2 (2.5),

so smaller values of ω produce more spread out utility curves. This is a case where we will need to integrate the area of overlap in two pieces, because the functions that deﬁne the area from above are different on either side of the intersection point. The ﬁrst problem encountered is that we do not have any of the integral limits: the points where the parabolas intersect the x-axis (although we only need two of the four from looking at the ﬁgure), and the point where the two parabolas intersect. To obtain the latter we will equate the two forms and solve with the quadratic equation from page 33 (striking out the 10’s and the multiplication by −1 here since they will cancel each other anyway). First

This is a quadratic form, so we get two possible answers. To ﬁnd the one we want, plug both potential values of x into one of the two parabolic forms and observe the y values: y = 10 − 2.5(5 − (4.0884))2 = 7.9226 y = 10 − 2.5(5 − (0.7687))2 = −34.7593.

Because we want the point of intersection that exists above the x-axis, the choice between the two x values is now obvious to make. To get the roots of the two parabolas (the points where y = 0), we can again apply the quadratic equation to the two parabolic forms (using the original form): x2 = −(42) ± (42)2 − 4(−6)(−63.5) = 2.209 or 4.79 2(−6) (25)2 − 4(−2.5)(−52.5) = 3 or 7. 2(−2.5)

x2 =

−(25) ±

5.6 The Fundamental Theorem of Calculus

217

We know that we want the greater root of the ﬁrst parabola and the lesser root of the second parabola (look at the picture), so we will use 3 and 4.79 as limits on the integrals. The area to integrate now consists of the following two-part problem, solved by the antiderivative method: A=
4.0884 3 4.79 4.0884

(−6x2 + 42x − 63.5)dx +
4.0884

(−2.5x2 + 25x − 52.5)dx
4.79

=

−2x + 21x − 63.5x
3

3

2

5 25 + − x3 + x2 − 52.5x 6 2

4.0884

= ((−45.27343) − (−55.5)) + ((−56.25895) − (−62.65137)) = 16.619. So we now know the area of the overlapping region above the x-axis between voter 2 and voter 3. If we wanted to, we could calculate the other overlapping regions between voters and compare as a measure of utility similarity on the issue space.

5.6.2 Indeﬁnite Integrals Indeﬁnite integrals are those that lack speciﬁc limits for the integration operation. The consequence of this is that there must be an arbitrary constant (labeled k here) added to the antiderivative to account for the constant component that would be removed by differentiating: f (x)dx = F (x) + k. That is, if F (x) + k is the antiderivative of f (x) and we were to calculate
d dx (F (x)

+ k) with deﬁned limits, then any value for k would disappear. The

logic and utility of indeﬁnite integrals is that we use them to relate functions rather than to measure speciﬁc areas in coordinate space, and the further study of this is called differential equations.

5.6.3 Integrals Involving Logarithms and Exponents We have already seen that the derivative of exponentials and logarithms are special cases:
d x dx e

= ex and

d dx

log(x) =

1 x.

For the most part, these are

important rules to memorize, particularly in statistical work. The integration process with logarithms and exponents is only slightly more involved. Recall that the chain rule applied to the exponential function takes the form du d u e = eu . dx dx This means that the form of the u function remains in the exponent but its derivative comes down. So, for example,
2 d 3x2 −x e = e3x −x (6x − 1), dx

which is simple if one can remember the rule. For integration it is essential to keep track of the “reverse chain rule” that comes from this principle: eu du = eu + k. This means that the u function must be incorporated into the limit deﬁnition to reverse
du dx .

In addition, we have to add a constant k that could have been there

d but was lost due to the derivative function operating on it ( dx (k) = 0).

In the following example the function in the exponent is f (x) = −x, so we alter the limit multiplying by −1 so that the exponent value and limit are identical and the regular property of e applies:
2 0 2 0

which seems very difﬁcult until we make a substitution. First deﬁne u = 1+ex, which changes the integral to ex dx. u This does not seem to help us much until we notice that
d x dx (1 + e )

= ex = du,

meaning that we can make following second substitution: ex dx = 1 + ex the rule
d dy

du , u

where u = 1 + ex and du =

d (1 + ex ) = ex . dx

So the seemingly difﬁcult integral now has a very simple antiderivative (using log(y) = 1/y), which we can perform and then substitute back to ex dx = 1 + ex du = log(u) + k = log(1 + ex ) + k. u

the quantity of interest:

What this demonstrates is that the rules governing exponents and logarithms for derivatives can be exploited when going in the opposite direction. When these sorts of substitutions are less obvious or the functions are more complicated, the next tool required is integration by parts.

5.6.4 Integration by Parts So far we have not developed a method for integrating functions that are products, although we did see that differentiating products is quite straightforward. Suppose we have an integral of the form f (x)g(x)dx. Often it is not always easy to see the structure of the antiderivative here. We will now derive a method, integration by parts, that gives a method for unwinding the product rule. The trick is to recharacterize part of the function into the d() argument.

220

Elementary Scalar Calculus

We will start with a basic derivation. Suppose ﬁrst that we label f (x) = u and g(x) = v, and note the shorthand versions of the derivatives
d dx u d dx v

By trivially rearranging this form we get the formula for integration by parts: udv = uv − vdu.

This means that if we can rearrange the integral as a product of the function u and the derivative of another function dv, we can get uv, which is the product of u and the integral of dv minus a new integral, which will hopefully be easier to handle. If the latter integral requires it, we can repeat the process with new terms for u and v. We also need to readily obtain the integral of dv to get uv, so it is possible to choose terms that do not help. As the last discussion probably foretells, there is some “art” associated with integration by parts. Speciﬁcally, how we split up the function to be integrated must be done strategically so that get more simple constituent parts on the right-hand side. Here is the classic textbook example: x log(x)dx,

5.6 The Fundamental Theorem of Calculus

221

which would be challenging without some procedure like the one described. The ﬁrst objective is to see how we can split up x log(x)dx into udv. The two possibilities are [u][dv] = [x][log(x)] [u][dv] = [log(x)][x], where the choice is clear since we cannot readily obtain v = dvdx =

log(x)dx. So picking the second arrangement gives the full mapping:

u = log(x) 1 du = dx x

dv = xdx v= 1 2 x . 2

This physical arrangement in the box is not accidental; it helps to organize the constituent pieces and their relationships. The top row multiplied together should give the integrand. The second row is the derivative and the antiderivative of each of the corresponding components above. We now have all of the pieces mapped out for the integration by parts procedure: udv = uv − = (log(x)) = = = vdu. 1 2 x 2 − 1 xdx 2 1 2 x 2 1 dx x

1 2 x log(x) − 2

1 2 1 1 x log(x) − ( x2 ) + k 2 2 2 1 2 1 x log(x) − x2 + k. 2 4

We beneﬁted from a very simple integral in the second stage, because the antiderivative of 1 x is straightforward. It can be the case that this integral is 2 more difﬁcult than the original one, which means that the choice of function assignment needs to be rethought.

222

Elementary Scalar Calculus 5.6.4.1 Application: The Gamma Function

The gamma function (also called Euler’s integral ) is given by Γ(ω) =
∞ 0

tω−1 e−t dt, ω > 0.

Here t is a “dummy” variable because it integrates away (it is a placeholder for the limits). The gamma function is a generalization of the factorial function that can be applied to any positive real number, not just integers. For integer values, though, there is the simple relation: Γ(n) = (n − 1)!. Since the result of the gamma function for any given value of ω is ﬁnite, the gamma function shows that ﬁnite results can come from integrals with limit values that include inﬁnity. Suppose we wanted to integrate the gamma function for a known value of ω, say 3. The resulting integral to calculate is
∞ 0

t2 e−t dt.

There are two obvious ways to split the integrand into u and dv. Consider this one ﬁrst:

u = e−t du = −e−t

dv = t2 v= 1 3 t . 3

The problem here is that we are moving up ladders of the exponent of t, thus with each successive iteration of integration by parts we are actually making the subsequent logical split is vdu integral more difﬁcult. So this will not do. The other

u = t2 du = 2t

dv = e−t v = −e−t .

5.6 The Fundamental Theorem of Calculus

223

So we proceed with the integration by parts (omitting the limits on the integral for the moment): Γ(3) = uv − vdu (−e−t )(2t)dt e−t tdt.

= (t2 )(−e−t ) − = −e−t t2 + 2

Of course we now need to repeat the process to calculate the new integral on the right-hand side, so we will split this new integrand (e−t t) up in a similar fashion: dv = e−t v = −e−t .

which still includes her subjective estimate of x0 , and we have plugged in the linear function u(x) = x/v in the second part. Actually, it might be more appropriate to calculate this with a sum since v is discrete, but with a large value it will not make a substantial difference and the sum would be much harder. This formulation means that the rate of change in utility for a change in x0 (she votes) is ∂EV 1 =− ∂x0 v
v 0

x

∂ g(x − x0 )dx, ∂x0

which requires some technical “regularity” conditions to let the derivative pass inside the integral. To solve this integral, integration by parts is necessary

∂EV 1 = − (vg(x − x0 )dx − 1). ∂x0 v This means that as g(x−x0) goes to zero (the expectation of actually affecting the election) voting utility simpliﬁes to 1/v, which returns us to the paradox of participation that resulted from the model on page 5. After developing this argument, Riker and Ordeshook saw the result as a refutation of the linear utility assumption for elections because a utility of 1/v fails to account for the reasonable number of people that show up at the polls in large elections.

5.7 Additional Topics: Calculus of Trigonometric Functions This section contains a set of trigonometry topics that are less frequently used in the social sciences but may be useful as references. As before, readers may elect to skip this section.

5.7.1 Derivatives of Trigonometric Functions The trigonometric functions do not provide particularly intuitive derivative forms, but fortunately the two main results are incredibly easy to remember. The derivative forms for the sine and cosine function are
d dx

sin(x) = cos(x)

d dx

cos(x) = − sin(x).

So the only difﬁcult part to recall is that there is a change of sign on the derivative of the cosine. Usually, this operation needs to be combined with the chain rule, because it is more common to have a compound function, u = g(x), rather than just x. The same rules incorporating the chain rule are given by d d sin(u) = cos(u) u dx dx d d cos(u) = − sin(u) u. dx dx

5.7.2 Integrals of Trigonometric Functions The integrals of the basic trigonometric functions are like the derivative forms: easy to understand, annoying to memorize, and simple to look up (don’t sell this book!). They can be given as either deﬁnite or indeﬁnite integrals. The two primary forms are sin(x)dx = − cos(x) + k cos(x)dx = sin(x) + k.

It is also important to be able to manipulate these integrals for the reverse chain rule operation, for instance, sin(u)du = − cos(u) + k. The other four basic

Comparing each suburb size separately, which of these two forms implies the greatest instantaneous change in y at x = 0.5? What is the interpretation on the minus sign for each coefﬁcient on x and a positive coefﬁcient for each coefﬁcient on x2 ?

Stephan and McMullin (1981) considered transportation issues and time minimization as a determinant of the distribution of county seats in the United States and elsewhere. The key trade-off is this: If territories are too small, then there may be insufﬁcient economic resources to sustain necessary services, and if territories are too large, then travel distances swamp economic advantages from scale. Deﬁne s as the average distance to traverse, v as the average speed, h as the total maintenance time required (paid for by the population), and p as the population size. The model for time proposed is T = s/v + h/p. √ Distance is proportional to area, so substitute in s = g a and p = ad, where g is a proportionality constant and a is area. Now ﬁnd the condition for a that minimizes time by taking the ﬁrst derivative of T with respect to a, setting it equal to zero, and solving. Show that this is a minimum by taking an additional (second) derivative with respect to a and noting that it must be positive.

Blackwell and Girshick (1954) derived the result below in the context of mixed strategies in game theory. Game theory is a tool in the social sciences where the motivations, strategies, and rewards to hypothesized competing actors are analyzed mathematically to make predictions or explain observed behavior. This idea originated formally with von Neumann and Morgenstern’s 1944 book. Simpliﬁed, an actor employs a mixed strategy when she has a set of alternative actions each with a known or assumed probability of success, and the

Exercises

233

choice of actions is made by randomly selecting one with that associated probability. Thus if there are three alternatives with success probabilities 1 , 1 , 1 , then there is a 50% chance of picking the ﬁrst, and so 2 3 6 on. Blackwell and Girshick (p.54) extended this idea to continuously measured alternatives between zero and one (i.e., a smooth function rather than a discrete listing). The ﬁrst player accordingly chooses value x ∈ [0:1], and the second player chooses value y ∈ [0:1], and the function that deﬁnes the “game” is given by M (x, y) = f (x − y), where ⎧ ⎪ ⎨t(1 − t),

f (t) =

for 0 ≥ t ≥ 1

⎪ ⎩f (t + 1), for − 1 ≥ t ≥ 0.

In other words, it matters which of x and y is larger in this game. Here is the surprising part. For any ﬁxed value of y (call this y0 ), the
1 expected value of the game to the ﬁrst player is 6 . To show this we

integrate over a range of alternatives available to this player:

1 0

M (x, y)dx = =

1 0

f (x − y0 )dx f (x − y0 )dx +
x<y0 1 y0 x>y0

y0 0

f (x − y0 )dx,

where breaking the integral into two pieces is necessary because the ﬁrst one contains the case where −1 ≥ t ≥ 0 and the second one contains the case where 0 ≥ t ≥ 1. Substitute in the two function values (t or t + 1) and integrate over x to obtain exactly 1 . 6

Show that the Mean Value Theorem is a special case of Rolle’s Theorem by generalizing the starting and stopping points. Calculate f (x) = x3 over the domain [0 : 1] using limits only (no power rule), as was done for f (x) = x2 on page 208. From the appendix to Krehbiel (2000), take the partial derivative of 50((M 2 − M/2 + δ)2 + 100(1 − M/2 + δ) − M ) M ((1 − M/2 + δ) − (M/2 + δ)) with respect to M and show that it is decreasing in M ∈ (0:1) (i.e., as M increases) for all values δ ∈ (0:1).

5.20

5.21

6
Additional Topics in Scalar and Vector Calculus

6.1 Objectives This chapter presents additional topics in basic calculus that go beyond that of the introductory material in Chapter 5. These topics include more advanced uses of derivatives like partial derivatives and higher order partial derivatives; root ﬁnding (locating an important point along some function of interest); analyzing function minima,maxima, and inﬂection points (points where derivatives change); integrals on functions of multiple variables; and, ﬁnally, the idea of an abstract series. In general, the material extends that of the last chapter and demonstrates some applications in the social sciences. A key distinction made in this chapter is the manner in which functions of multiple variables are handled with different operations.

6.2 Partial Derivatives A partial derivative is a regular derivative, just as we have already studied, except that the operation is performed on a function of multiple variables where the derivative is taken only with respect to one of them and the others are treated 235

Notice that the variables that are not part of the derivation process are simply treated as constants in these operations. We can also evaluate a more complex function with additional variables: f (u1 , u2 , u3 , u4 , u5 ) = uu2 u3 sin 1 u1 log 2 u4 u5 u2 , 3

Schultz (1970) analyzed the relationship between workers’ ages and their earnings with the idea that earnings increase early in a worker’s career but tend to decrease toward retirement. Thus the relationship is parabolic in nature according to their theory. Looking at maintenance electricians, they posited an additive relationship that affects income (in units of 1,000) according to Income = β0 + β1 (Seniority) + β2 (School.Y ears) + β3 (Experience) + β4 (T raining) + β5 (Commute.Distance) + β6 (Age) + β7 (Age2 ), where the β values are scalar values that indicate how much each factor individually affects Income (produced by linear regression, which is not critical to our discussion here). Since Age and Age2 are both included in the analysis, the effect of the workers’ age is parabolic, which is exactly as the authors intended: rising early, cresting, and then falling back. If we take the
¢

238

Additional Topics in Scalar and Vector Calculus

ﬁrst partial derivative with respect to age, we get ∂ Income = β6 + 2β7 Age, ∂Age where β6 = 0.031 and β7 = −0.00032 (notice that all the other causal factors disappear due to the additive speciﬁcation). Thus at age 25 the incremental effect of one additional year is 0.031 + 2(−0.00032)(25) = 0.015, at age 50 it is 0.031 + 2(−0.00032)(50) = −0.001, and at 75 it is 0.031 + 2(−0.00032)(75) = −0.017. Example 6.2: Indexing Socio-Economic Status (SES). Any early mea-

sure of socio-economic status by Gordon (1969) reevaluated conventional compilation of various social indicators as a means of measuring “the position of the individual in some status ordering as determined by the individual’s characteristics–his education, income, position, in the community, the market place, etc.” The basic idea is to combine separate collected variables into a single measure because using any one single measure does not fully provide an overall assessment of status. The means by which these indicators are combined can vary considerably: additive or multiplicative, scaled or unscaled, weighted according to some criteria, and so on. Gordon proposed an alternative multiplicative causal expression of the form for an individual’s socio-economic status: SES = AE b I c M d , where A = all terms not explicitly included in the model E = years of education I = amount of income M = percent of time employed annually and b, c, d are the associated “elasticities” providing a measure of strength for each causal term. The marginal (i.e., incremental) impacts on SES for

6.3 Derivatives and Partial Derivatives of Higher Order

239

each of the three terms of interest are given by partial derivatives with respect to the term of interest: ∂SES = AbE b−1 I c M d ∂E ∂SES = AcE b I c−1 M d ∂I ∂SES = AdE b I c M d−1 . ∂M The point is that individual effects can be pulled from the general multiplicative speciﬁcation with the partial derivatives. Note that due to the multiplicative nature of the proposed model, the marginal impacts are still dependent on levels of the other variables in a way that a strictly additive model would not produce: SES = A + E b + I c + M d .

6.3 Derivatives and Partial Derivatives of Higher Order Derivatives of higher order than one (what we have been doing so far) are simply iterated applications of the derivative process. The second derivative of a function with respect to x is ∂ 2 f (x) ∂ = 2 ∂x ∂x ∂ f (x) , ∂x

Note the convention with respect to the order designation here, and that it differs by placement in the numerator (∂ 5 ) and denominator (∂x5 ). Of course at some point a function will cease to support new forms of the higher order derivatives when the degree of the polynomial is exhausted. For instance, given the function

240 f (x) = 3x3 ,

Additional Topics in Scalar and Vector Calculus

∂f (x) = 9x2 ∂x ∂ 2 f (x) = 18x ∂x2 ∂ 3 f (x) = 18 ∂x3

∂ 4 f (x) =0 ∂x4 ∂ 5 f (x) =0 ∂x5 . . .

Thus we “run out” of derivatives eventually, and all derivatives of order four or higher remain zero. So how do we interpret higher order derivatives? Because the ﬁrst derivative is the rate of change of the original function, the second derivative is the rate of change for the rate of that change, and so on. Consider the simple example of the velocity of a car. The ﬁrst derivative describes the rate of change of this velocity: very high when ﬁrst starting out from a trafﬁc light, and very low when cruising on the highway. The second derivative describes how fast this rate of change is changing. Again, the second derivative is very high early in the car’s path as it increases speed rapidly but is low under normal cruising conditions. Third-order and higher derivatives work in exactly this same way on respective lower orders, but the interpretation is often less straightforward. Higher order derivation can also be applied to partial derivatives. Given the function f (x, y) = 3x3 y 2 , we can calculate ∂ ∂2 f (x, y) = (9x2 y 2 ) = 18x2 y ∂x∂y ∂y and ∂3 ∂2 f (x, y) = (9x2 y 2 ) ∂x2 ∂y ∂x∂y = ∂ (18xy 2 ) ∂y

= 36xy.

6.4 Maxima, Minima, and Root Finding

241

Thus the order hierarchy in the denominator gives the “recipe” for how many derivatives of each variable to perform, but the sequence of operations does not change the answer and therefore should be done in an order that makes the problem as easy as possible. Obviously there are other higher order partial derivatives that could be calculated for this function. In fact, if a function has k variables each of degree n, the number of derivatives of order n is given by
n+k−1 n

.

6.4 Maxima, Minima, and Root Finding Derivatives can be used to ﬁnd points of interest along a given function. One point of interest is the point where the “curvature” of the function changes, given by the following deﬁnition: • [Inﬂection Point.] For a given function, y = f (x), a point (x∗ , y ∗ ) is called an inﬂection point if the second derivative immediately on one side of the point is signed oppositely to the second derivative immediately on the other side. So if x∗ is indeed an inﬂection point, then for some small interval around x∗ , [x∗ − δ, x∗ + δ], f (x) is positive on one side and negative on the other. Interestingly, if f (x) is continuous at the inﬂection point f (x∗), then f (x∗ ) = 0. This makes intuitive sense since on one side f (x) is increasing so that f (x) must be positive, and on the other side f (x) is decreasing so that f (x) must be negative. Making δ arbitrarily small and decreasing it toward zero, x∗ is the point where δ vanishes and the second derivative is neither positive nor negative and therefore must be zero. Graphically, the tangent line at the inﬂection point crosses the function such that it is on one side before the point and on the other side afterward. This is a consequence of the change of sign of the second derivative of the function and is illustrated in Figure 6.1 with the function f (x) = (x3 − 15x2 + 60x + 30)/15. The function (characteristically) curves away from the tangent line on one side

242

Additional Topics in Scalar and Vector Calculus
Fig. 6.1. Illustrating the Inflection Point

f(x) = (x3 − 15x2 + 60x + 30) 15

*

(5,5.3333)

fl(x) = (x2 − 10x + 20) 5

fl (x) = (2x − 10) 5

* (5,0)

0

1

2

3

4

5

6

7

8

9

10

and curves away in the opposite direction on the other side. In the ﬁrst panel of this ﬁgure the function itself is plotted with the tangent line shown. In the second panel the ﬁrst derivative function, f (x) = (x2 − 10x + 20)/5, is plotted with the tangent line shown at the corresponding minima. Finally the third panel shows the second derivative function, f (x) = (2x − 10)/5, with a horizontal line where the function crosses zero on the y-axis. Note that having a zero second derivative is a necessary but not sufﬁcient condition for being an inﬂection point since the function f (5) has a zero second derivative, but f (5) does not change sign just around 5.

−2

−1

l

0

1

2

−2

* (5,−1)

0

1

2

3

4

5

2

3

4

5

6

7

8

9

6.4 Maxima, Minima, and Root Finding Example 6.3:

243

Power Cycle Theory. Sometimes inﬂection points can be

substantively helpful in analyzing politics. Doran (1989) looked at critical changes in relative power between nations, evaluating power cycle theory, which asserts that war is caused by changes in the gap between state power and state interest. The key is that when dramatic differences emerge between relative power (capability and prestige) and systematic foreign policy role (current interests developed or allowed by the international system), existing balances of power are disturbed. Doran particularly highlighted the case where a nation ascendent in both power and interest, with power in excess of interest, undergoes a “sudden violation of the prior trend” in the form of an inﬂection point in their power curve.
Fig. 6.2. Power/Role Gap Changes Induced by the Inflection Point

π2

π1

+
Plausible Range of Role

t0

t1

t2

t3

t4

t5

t6

t7

t8

t9

244

Additional Topics in Scalar and Vector Calculus This is illustrated in Figure 6.2, where the inﬂection point at time t5 in the

power curve introduces uncertainty in the anticipated decline in increasing power (i.e., the difference in π 1 and π 2 in the ﬁgure) and the subsequent potential differences between power and the range of plausible roles. Thus the problem introduced by the inﬂection point is that it creates changes in the slope difference and thus changes in the gap between role and power that the state and international system must account for.

The function f (x) = (x3 − 15x2 + 60x + 30)/15 in Figure 6.1 has a single inﬂection point in the illustrated range because there is on point where the concave portion of the function meets the convex portion. If a function (or portion of a function) is convex (also called concave upward), then every possible chord is above the function, except for their endpoints. A chord is just a line segment connecting two points along a curve. If a function (or portion of a function) is concave (or concave downward), then every possible chord is below the function, except for their endpoints. Figure 6.3 shows chords over the concave portion of the example function, followed by chords over the convex portion.

Fig. 6.3. Concave and Convex Portions of a Function
9

f(x) = (x3 − 15x2 + 60x + 30) 15

convex

concave

0

1

2

3

4

5

6

7

8

9

10

2

3

4

5

6

7

8

6.4 Maxima, Minima, and Root Finding

245

There is actually a more formal determination of concave and convex forms. If a function f (x) is twice differentiable and concave over some open interval, then f (x) ≤ 0, and if a function f (x) is twice differentiable and convex over some open interval then f (x) ≥ 0. The reverse statement is also true: A twice differentiable function that has a nonpositive second derivative over some interval is concave over that interval, and a twice differentiable function that has a nonnegative derivative over some interval is convex over that interval. So the sign of the second derivative gives a handy test, which will apply in the next section.

6.4.1 Evaluating Zero-Derivative Points Repeated application of the derivative also gives us another general test: the second derivative test. We saw in Chapter 5 (see, for instance, Figure 5.4 on page 186) that points where the ﬁrst derivatives are equal to zero are either a maxima or a minima of the function (modes or anti-modes), but without graphing how would we know which? Visually it is clear that the rate of change of the ﬁrst derivative declines, moving away from a relative maximum point in both directions, and increases, moving away from a relative minimum point. The term “relative” here reinforces that these functions may have multiple maxima and minima, and we mean the behavior in a small neighborhood around the point of interest. This rate of change observation of the ﬁrst derivative means that we can test with the second derivative: If the second derivative is negative at a point where the ﬁrst derivative is zero, then the point is a relative maximum, and if the second derivative is positive at such a point, then the point is a relative minimum. Higher order polynomials often have more than one mode. So, for example, we can evaluate the function f (x) = 11 11 1 4 x − 2x3 + x2 − 6x + . 4 2 4

246

Additional Topics in Scalar and Vector Calculus

The ﬁrst derivative is f (x) = x3 − 6x2 + 11x − 6, which can be factored as = (x2 − 3x + 2)(x − 3) = (x − 1)(x − 2)(x − 3). Since the critical values are obtained by setting this ﬁrst derivative function equal to zero and solving, it is apparent that they are simply 1, 2, and 3. Substituting these three x values into the original function shows that at the three points (1, 0.5), (2, 0.75), (3, 0.5) there is a function maximum or minimum. To determine whether each of these is a maximum or a minimum, we must ﬁrst obtain the second derivative form, f (x) = 3x2 − 12x + 11, and then plug in the three critical values: f (1) = 3(1)2 − 12(1) + 11 = 2 f (2) = 3(2)2 − 12(2) + 11 = −1 f (3) = 3(3)2 − 12(3) + 11 = 2.

This means that (1, 0.5) and (3, 0.5) are minima and (2, 0.75) is a maximum. We can think of this in terms of Rolle’s Theorem from the last chapter. If we modiﬁed the function slightly by subtracting
1 2

(i.e., add

9 4

instead of

11 4 ),

then two minima would occur on the x-axis. This change does not alter the fundamental form of the function or its maxima and minima; it just shifts the function up or down on the y-axis. By Rolle’s Theorem there is guaranteed to be another point in between where the ﬁrst derivative is equal to zero; the point at x = 2 here is the only one from the factorization. Since the points at x = 1

6.4 Maxima, Minima, and Root Finding

247

and x = 3 are minima, then the function increases away from them, which means that x = 2 has to be a maximum.

6.4.2 Root Finding with Newton-Raphson A root of a function is the point where the function crosses the x-axis: f (x) = 0. This value is a “root” of the function f () in that it provides a solution to the polynomial expressed by the function. It is also the point where the function crosses the x-axis in a graph of x versus f (x). A discussion of polynomial function roots with examples was given on page 33 in Chapter 1. Roots are typically substantively important points along the function, and it is therefore useful to be able to ﬁnd them without much trouble. Previously, we were able to easily factor such functions as f (x) = x2 − 1 to ﬁnd the roots. However, this is not always realistically the case, so it is important to have a more general procedure. One such procedure is Newton’s method (also called Newton-Raphson) . The general form of Newton’s method also (to be derived) is a series of steps according to f (x0 ) , x1 ∼ x0 − = f (x0 ) where we move from a starting point x0 to x1 , which is closer to the root, using characteristics of the function itself. Newton’s method exploits the Taylor series expansion, which gives the relationship between the value of a mathematical function at the point x0 and the function value at another point, x1 , given (with continuous derivatives over the relevant support) as f (x1 ) = f (x0 ) + (x1 − x0 )f (x0 ) + 1 (x1 − x0 )2 f (x0 ) 2! 1 + (x1 − x0 )3 f (x0 ) + · · · , 3!

where f is the ﬁrst derivative with respect to x, f is the second derivative with respect to x, and so on. Inﬁnite precision between the values f (x1 ) and f (x0 ) is achieved with the inﬁnite extending of the series into higher order derivatives

(of course the factorial component in the denominator means that these are rapidly decreasing increments). We are actually interested in ﬁnding a root of the function, which we will suppose exists at the undetermined point x1 . What we know so far is that for any point x0 that we would pick, it is possible to relate f (x1 ) and f (x0 ) with the Taylor series expansion. This is simpliﬁed in two ways. First, note that if x1 is a root, then f (x1 ) = 0, meaning that the left-hand side of the two above equations is really zero. Second, while we cannot perfectly relate the f (x0 ) to the function evaluated at the root because the expansion can never by fully evaluated, it should be obvious that using some of the ﬁrst terms at least gets us closer to the desired point. Since the factorial function is increasing rapidly, let us use these last two facts and assert that 0 ∼ f (x0 ) + (x1 − x0 )f (x0 ), = where the quality of the approximation can presumably be improved with better guesses of x0 . Rearrange the equation so that the quantity of interest is on the left-hand side: f (x0 ) x1 ∼ x0 − , = f (x0 ) to make this useful for candidate values of x0 . If this becomes “algorithmic” because x0 is chosen arbitrarily, then repeating the steps with successive approximations gives a process deﬁned for the (j + 1)th step: f (xj ) xj+1 ∼ xj − , = f (xj ) so that progressively improved estimates are produced until f (xj+1 ) is sufﬁciently close to zero. The process exploits the principle that the ﬁrst terms of

6.4 Maxima, Minima, and Root Finding

249

the Taylor series expansion get qualitatively better as the approximation gets closer and the remaining (ignored) steps get progressively less important. Newton’s method converges quadratically in time (number of steps) to a solution provided that the selected starting point is reasonably close to the solution, although the results can be very bad if this condition is not met. The key problem with distant starting points is that if f (x) changes sign between this starting point and the objective, then the algorithm may even move away from the root (diverge). As a simple example of Newton’s method, suppose that we wanted a numerical routine for ﬁnding the square root of a number, µ. This is equivalent to ﬁnding the root of the simple equation f (x) = x2 − µ = 0. The ﬁrst derivative is just
∂ ∂x f (x)

Recall that µ is a constant here deﬁned by the problem and xj is an arbitrary value at the jth step. A very basic algorithm for implementing this in software or on a hand calculator is delta = 0.0000001 x = starting.value DO: x.new = 0.5*(x + mu/x) x = x.new UNTIL: abs(xˆ2 - mu) < delta where abs() is the absolute value function and delta is the accuracy threshold that we are willing to accept. If we are interested in getting the square root of 99 and we run this algorithm starting at the obviously wrong point of x = 2, we get:

6.5 Multidimensional Integrals As we have previously seen, the integration process measures the area under some function for a given variable. Because functions can obviously have multiple variables, it makes sense to deﬁne an integral as measuring the area (volume actually) under a function in more than one dimension. For two variables, the iterated integral (also called the repeated integral), is given in deﬁnite form here by
b d

V =
a c

f (x, y)dydx,

where x is integrated between constants a and b, and y is integrated between constants c and d. The best way to think of this is by its inherent “nesting” of operations:
b d

V =
a c

f (x, y)dy dx,

so that after the integration with respect to y there is a single integral left with respect to x of some new function that results from the ﬁrst integration:
b

V =
a

g(x)dx,

where g(x) =

d c

f (x, y)dy. That is, in the ﬁrst (inner) step x is treated as

a constant, and once the integration with respect to y is done, y is treated as a constant in the second (outer) step. In this way each variable is integrated

X
separately with respect to its limits. The idea of integrating under a surface with an iterated intregral is illustrated in Figure 6.4, where the black rectangle shows the region and the stripped region shows f (x, y). The integration provides the volume between this plan and this surface. Example 6.4: Double Integral with Constant Limits. In the following

So far all the limits of the integration have been constant. This is not particularly realistic because it assumes that we are integrating functions over rectangles only to measure the volume under some surface described by f (x, y). More generally, the region of integration is a function of x and y. For instance, consider the unit circle centered at the origin deﬁned by the equation 1 = x2 + y 2 . The limits of the integral in the x and the y dimensions now depend on the other variable, and an iterated integral over this region measures the cylindrical volume under a surface deﬁned by f (x, y). Thus, we need a means of performing the integration in two steps as before but accounting for this dependency. To generalize the iterated integration process above, we ﬁrst express the limits of the integral in terms of a single variable. For the circle

6.5 Multidimensional Integrals example we could use either of

253

y = gy (x) =

1 − x2

x = gx (y) =

1 − y2,

depending on our order preference. If we pick the ﬁrst form, then the integral limits in the inner operation are the expression for y in terms of x, so we label this as the function gy (x).

In the above example we would get identical (and correct) results integrating √ x ﬁrst with the limits (0, 1 − x2 ). This leads to the following general theorem. Iterated Integral Theorem: • A two-dimensional area of interest, denoted A, is characterized by either a ≤ x ≤ b, or c ≤ y ≤ d, gx1 (y) ≤ xgx2 (y). gy1 (x) ≤ ygy2 (x)

• The function to be integrated, f (x, y), is continuous for all of A. • Then the double integral over A is equivalent to either of the iterated integrals:
b

f (x, y)dA =
A a

gy2 (x) gy1 (x)

d

f (x, y)dydx =
c

gx2 (y) gx1 (y)

f (x, y)dxdy.

This theorem states that, like the case with constant limits, we can switch the order of integration if we like, and that in both cases the result is equivalent to the motivating double integral. Example 6.7: More Complicated Double Integral. Consider the prob-

lem of integrating the function f (x, y) = 3 + x2 over the region A, determined by the intersection of the function f1 (x) = (x − 1)2 and the function

(k is again an arbitrary constant in the case of indeﬁnite integrals).

6.6 Finite and Inﬁnite Series The idea of ﬁnite and inﬁnite series is very important because it underlies many theoretical principles in mathematics, and because some physical phenomena can be modeled or explained through a series. The ﬁrst distinction that we care about is whether a series converges or diverges. We will also be centrally concerned here with the idea of a limit as discussed in Section 5.2 of Chapter 5. Example 6.8: The Naziﬁcation of German Sociological Genetics. The

Hitlerian regime drove out or killed most of the prominent German sociologists of the time (a group that had enormous inﬂuence on the discipline).

6.6 Finite and Inﬁnite Series

257

The few remaining German sociologists apparently supported Nazi principles of heredity even though these were wrong and the correct chromosonal theory of inheritance had been published and supported in Germany since 1913 (Hager 1949). The motivation was Hitler’s fascination with Reinrassen (pure races) as opposed to Mischrassen (mixed races), although such distinctions have no scientiﬁc basis in genetics whatsoever. These sociologists prescribed to Galton’s (1898) theory that “the two parents, between them, contribute on average one half each inherited faculty,. . . ” and thus a person’s contributed genetics is related to a previous ancestor by the series 1 1 1 1 1 , , , , ,... 2 4 8 16 32 (which interestingly sums to 1). This incorrect idea of heredity supported Hitler’s goal of passing laws forbidding mixed marriages (i.e., “Aryan” and “non-Aryan” in this case) because then eventually German society could become 100% Aryan since the fraction above goes to zero in the limit (recall that it was claimed to be a “thousand year Reich”).

Example 6.9:

Measuring Small Group Standing. Some sociologists

care about the relative standing of individuals in small, bounded groups. This can be thought of as popularity, standing, or esteem, broadly deﬁned. Suppose there are N members of the group and A is the N × N matrix where a 1 for aij indicates that individual i (the row) chooses to associate with individual j (the column), and 0 indicates the opposite choice. For convenience, the diagonal values are left to be zero (one cannot choose or not choose in this sense). The early literature (Moreno 1934) posited a ranking of status that simply added up choice reception by individual, which can be done by multiplying the A matrix by an appropriate unit vector to

258

Additional Topics in Scalar and Vector Calculus

sum by columns, for instance, ⎡ ⎤ ⎡ ⎤ ⎡ ⎤

⎢ ⎢ ⎢ 1 status1 = A u = ⎢ ⎢ ⎢ 1 ⎣ 1

0

0 1 0 0 1 0 1 1

0

⎥ ⎥ 0 ⎥ ⎥ ⎥ 0 ⎥ ⎦ 0

⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ ⎢ ⎢ 1 ⎥ ⎢ 2 ⎥ ⎥. ⎥=⎢ ⎢ ⎥ ⎥ ⎢ ⎢ ⎢ 1 ⎥ ⎢ 2 ⎥ ⎦ ⎦ ⎣ ⎣ 0 1

1

3

Later it was proposed (Katz 1953) that indirect associations also led to increased standing, so a means of incorporating indirect paths throughout this matrix needed to be included. Of course we would not want to require that indirect paths be equal in weight to direct paths, so we include a weighting factor, 0 < α < 1, that discounts distance. If we include every possible path, then the matrix that keeps track of these is given by the ﬁnite matrix series:

B = αA + α2 A A + α3 AA A + α4 A AA A + . . . + αN A A · · · A A

(assuming N is even) so that the Katz measure of standing for the example above is now status2 = B u = ⎛ ⎡ ⎤ ⎡ ⎤⎞

These calculations obviously used α = 0.5. The ﬁnal column vector shows relative standing that includes these other paths. This new measure apparently improves the standing of the second person. 6.6.1 Convergence nth partial sum The key point of a series is that the consecutively produced values are generated by some sort of a rule or relationship. This can be anything from simply adding some amount to the previous value or a complex mathematical form operating on consecutive terms. Notationally, start with an inﬁnite series:
∞

S∞ =
i=1

xi = x1 + x2 + x3 + · · · ,

which is just a set of enumerated values stretching out to inﬁnity. This is not to say that any one of these values is itself equal to inﬁnity or even that their sum is necessarily inﬁnite, but rather that the quantity of them is inﬁnite. Concurrently, we can also deﬁne a ﬁnite series of length n:
n

Sn =
i=1

xi = x1 + x2 + x3 + · · · + xn ,

which is a series that terminates with the nth value: xn . This may also simply be the ﬁrst n values of an inﬁnite series and in this context is called an nth partial sum of the larger inﬁnite sequence. The difference in subscript of S on the left-hand side emphasizes that the length of these summations differs.

260

Additional Topics in Scalar and Vector Calculus

A series is convergent if the limit as n goes to inﬁnity is bounded (noninﬁnite itself):
n→∞

lim Sn = A, where A is bounded.

A series is divergent if it is not convergent, that is, if A above is positive or negative inﬁnity. Another test is stated in the following theorem.

• If Sn is a series with all positive terms and f (x) is the corresponding function that must be decreasing and everywhere continuous for values of x ≥ 1, then the series Sn and the integral
∞ 1

f (x)dx both converge or diverge.

It is important to think about these statements carefully. It is not true that a zero limit of xn shows convergence (the logic only goes in one direction). For instance, a harmonic series (see the Exercises) has the property that xn goes to zero in the limit, but it is a well-known divergent series. Convergence is a handy result because it means that the inﬁnite series can be approximated by a reasonable length ﬁnite series (i.e., additional values become unimportant at some point). So how does this test work? Let us now evaluate the limiting term of the series:
n

i=1

i−1 1 2 3 = 0 + + + + ··· . i+1 3 4 5

We could note that the values get larger, which is clearly an indication of a diverging series. The integral test also shows this, because 1− n−1 = lim n→∞ n + 1 n→∞ 1 + lim
1 n 1 n

= 1,

which is not zero, indicating divergence of the series. The integral part of the statement above relates the characteristic of a series with an integral, so that if we can obtain convergence of one, we can establish convergence of the other.

6.6 Finite and Inﬁnite Series Consider the simple series and associated integral
∞

261

S∞ =
i=1

1 , i3

I∞ =

∞ 1

1 dx. x3

The integral quantity is 1 , so we know that the series converges. 2 Here are some famous examples along with their convergence properties. Example 6.10:
∞ i=1

diverges if |r| > 1. The series also diverges for r = 1 since it is then simply

Repeating Values as a Geometric Series. Consider the

123 123 123 + + + ··· 1 2 1000 1000 10003

which is expressed in the second form as a geometric series with k = 123 and r = 0.001. Clearly this sequence converges because r is (much) less than one. Because it can sometimes be less than obvious whether a series is convergent, a number of additional tests have been developed. The most well known are listed below for the inﬁnite series
∞ i=1

ai . = A, then the series converges for
1

• Ratio Test. If every ai > 0 and lim

ai+1 i→∞ ai

A < 1, diverges for A > 1, and may converge or diverge for A = 1. • Root Test. If every ai > 0 and lim (ai ) i = A, then the series converges
i→∞

for A < 1, diverges for A > 1, and may converge or diverge for A = 1. • Comparison Test. If there is a convergent series (ﬁnite) integer value J such that ai ≤ bi ∀i ≥ J, then
∞ i=1 bi and a positive ∞ i=1 ai converges.

6.6 Finite and Inﬁnite Series Some Properties of Convergent Series Limiting Values
n→∞

263

lim an = 0
n→∞ ∞ i=1

(if lim an ! = 0, then Summation Scalar Multiplication
∞ i=1

ai diverges) + bi )

ai +

∞ i=1 bi

= ai

∞ i=1 (ai

∞ i=1

kai = k

∞ i=1

Example 6.14:

An Equilibrium Point in Simple Games. Consider the

basic prisoner’s dilemma game, which has many variants, but here two parties obtain 10 each for both cooperating, 15 dollars for acting opportunistically when the other acts cooperatively, and only 5 each for both acting opportunistically. What is the value of this game to a player who intends to act opportunistically at all iterations and expects the other player to do so as well? Furthermore, assume that each player discounts the future value of payoffs by 0.9 per period. Then this player expects a minimum payout of $5(0.90 + 0.91 + 0.92 + 0.93 + . . . + 0.9∞ ). The component in parentheses is a geometric series where r = 0.9 < 1,
1 so it converges giving $5 1−0.9 = $50. Of course the game might be worth
¢ ¢ ¢

slightly more to our player if the opponent was unaware of this strategy on the ﬁrst or second iteration (presumably it would be quite clear after that). 6.6.1.1 Other Types of Inﬁnite Series Occasionally there are special characteristics of a given series that allow us to assert convergence. A series where adjacent terms are alternating in sign for the whole series is called an alternating series. An alternating series converges if the same series with absolute value terms also convergences. So if is an alternating series, then it converges if
∞ i=1 ∞ i=1

ai

|ai | converges. For instance,

264

Additional Topics in Scalar and Vector Calculus

the alternating series given by
∞ i=1

(−1)i+1 i2

converges if

∞ 1 i=1 i2

converges since the latter is always greater for some given
∞ 1

i value. This series converges if the integral is ﬁnite
∞ 1

1 dx = −x−1 x2

=−

1 1 − (− ) = 1, ∞ 1

so the second series converges and thus the original series converges. Another interesting case is the power series, which is a series deﬁned for x of the form
∞

This type of power series has the characteristic that if it converges for the given value of x0 = 0, then it converges for |x| < |x0 |. Conversely, if the power series diverges at x0 , then it also diverges for |x| > |x0 |. There are three power series that converge in important ways:
∞ i=1 ∞ i=1

xi = ex i!

(−1)i x2i+1 = sin(x) (2i + 1)! (−1)i x2i = cos(x). (2i)!

∞ i=1

The idea here is bigger than just these special cases (as interesting as they are). It turns out that if a function can be expressed in the form f (x) =
∞ i=1

ai (x − x0 )i , then it has derivatives of all orders and ai can be expressed

as the ith derivative divided by the ith factorial. Note that the converse is not

which is just the Taylor series discussed in Section 6.4.2. The trick of course is expressing some function of interest in terms of such a series including the sequence of increasing derivatives. Also, the ability to express a function in this form does not guarantee convergence for particular values of x; that must be proven if warranted. A special case of the Taylor series is the Maclaurin series, which is given when x0 = 0. Many well-known functions can be rewritten as a Maclaurin series. For instance, now express f (x) = log(x) as a Maclaurin series and compare at x = 2 to x = 1 where f (x) = 0. We ﬁrst note that f (x) = f (x) = f (x) = f (x) = . . . which leads to the general order form for the derivative f (i) (x) = (−1)i+1 (i − 1)! . xi 1 x −1 x2 2 x3 −6 x4

So the function of interest can be expressed as follows by plugging in the

6.7 The Calculus of Vector and Matrix Forms This last section is more advanced than the rest of the chapter and may be skipped as it is not integral to future chapters. A number of calculus techniques operate or are notated differently enough on matrices and vectors that a separate section is warranted (if only a short one). Sometimes the notation is confusing when one misses the point that derivatives and integrals are operating on these larger, nonscalar structures.

6.7.1 Vector Function Notation Using standard (Hamiltonian) notation, we start with two orthogonal unit vectors i and j starting at the origin and following along the x-axis and y-axis correspondingly. Any vector in two-space (R2 ) can be expressed as a scalarweighted sum of these two basis vectors giving the horizontal and vertical progress: v = ai + bj. So, for example, to characterize the vector from the point (3, 1) to the point (5, 5) we use v = (5 − 3)i + (5 − 1)j = 2i + 4j. Now instead of the scalars a

meaning that for some value of t we have a point on the line. To get the expression for this line in standard slope-intercept form, we ﬁrst ﬁnd the slope by getting the ratio of differences (5 − 1)/(5 − 3) = 2 in the standard fashion and subtracting from one of the two points to get the y value where x is zero: (0, −5). Setting x = t, we get y = −5 + 2x. So far this setup has been reasonably simple. Now suppose that we have some curvilinear form in R given the functions f1 (t) and f2 (t), and we would like to get the slope of the tangent line at the point t0 = (x0 , y0 ). This, it turns out, is found by evaluating the ratio of ﬁrst derivatives of the functions R (t0 ) = f2 (t0 ) , f1 (t0 )

where we have to worry about the restriction that f1 (t0 ) = 0 for obvious reasons. Why does this work? Consider what we are doing here; the derivatives are producing incremental changes in x and y separately by the construction with i and j above. Because of the limits, this ratio is the instantaneous change in y for a change in x, that is, the slope. Speciﬁcally, consider this logic in the notation: ∆y = lim ∆t→0 ∆x
∆t→0

∆y lim ∆t ∆x lim ∆t

=

∆t→0

∂y ∂t ∂x ∂t

=

∂y ∂y ∂t = . ∂t ∂x ∂x

For example, we can ﬁnd the slope of the tangent line to the curve x =

268

Additional Topics in Scalar and Vector Calculus

3t3 + 5t2 + 7, y = t2 − 2, at t = 1: f1 (1) = 9t2 + 10t
t=1

= 19

f2 (1) = 2t
t=1

=2

R (1) =

2 . 19

We can also ﬁnd all of the horizontal and vertical tangent lines to this curve by a similar calculation. There are horizontal tangent lines when f1 (t) = 9t2 + 10t = 0. Factoring this shows that there are horizontal tangents when t = 0, t = − 10 . Plugging these values back into x = 3t3 + 5t2 + 7 gives 9 horizontal tangents at x = 7 and x = 9.058. There are vertical horizontal lines when f2 (t) = 2t = 0, which occurs only at t = 0, meaning y = −2. 6.7.2 Diﬀerentiation and Integration of a Vector Function The vector function f (t) is differentiable with domain t if the limit
∆t→0

lim

f (t + ∆t) − f (t) ∆t

exists and is bounded (ﬁnite) for all speciﬁed t. This is the same idea we saw for scalar differentiation, except that by consequence f (t) = f1 (t)i + f2 (t)j, which means that the function can be differentiated by these orthogonal pieces. It follows also that if f (t) meets the criteria above, then f (t) exists, and so on. As a demonstration, let f (t) = e5t i + sin(t)j, so that f (t) = 5e5 ti + cos(t)j. Not surprisingly, integration proceeds piecewise for the vector function just as differentiation was done. For f (t) = f1 (t)i + f2 (t)j, the integral is ⎧ ⎪ ⎨ f (t)dt = f1 (t)dt i + f2 (t)dt j + K for the indeﬁnite form, ⎪ ⎩
b a

f (t)dt =

b a

f1 (t)dt i +

b a

f2 (t)dt j

for the deﬁnite form.

6.7 The Calculus of Vector and Matrix Forms

269

Incidently, we previously saw an arbitrary constant k for indeﬁnite integrals of scalar functions, but that is replaced here with the more appropriate vectorvalued form K. This “splitting” of the integration process between the two dimensions can be tremendously helpful in simplifying difﬁcult dimensional problems. Consider the trigonometric function f (t) = tan(t)i + sec2 (t)j. The integral over [0:π/4] is produced by
π/4 0 π/4 π/4

Since f (0) is the function value when the components above are zero, except for K we can substitute this for K to complete f (t) = 1 3 t +4 i+ 3 1 5 t − 2 j. 5

In statistical work in the social sciences, a scalar-valued vector function is important for maximization and description. We will not go into the theoretical derivation of this process (maximum likelihood estimation) but instead will describe the key vector components. Start with a function: y = f (x) = f (x1 , x2 , x3 . . . , xk ) operating on the k-length vector x. The vector of partial

Note that the partial derivatives in the last (most succinct) form are done on vector quantities. This matrix, called the Hessian after its inventor/discover the German mathematician Ludwig Hesse, is square and symmetrical. In the course of normal statistical work it is also positive deﬁnite, although serious problems arise if for some reason it is not positive deﬁnite because it is necessary to invert the Hessian in many estimation problems.

6.8 Constrained Optimization 6.8 Constrained Optimization

271

This section is considerably more advanced than the previous and need not be covered on the ﬁrst read-through of the text. It is included because constrained optimization is a standard tool in some social science literatures, notably economics. We have already seen a similar example in the example on page 187, where a cost function was minimized subject to two terms depending on committee size. The key feature of these methods is using the ﬁrst derivative to ﬁnd a point where the slope of the tangent line is zero. Usually this is substantively interesting in that it tells us where some x value leads to the greatest possible f (x) value to maximize some quantity of interest: money, utility, productivity, cooperation, and so on. These problems are usually more useful in higher dimensions, for instance, what values of x1 , x2 , and x3 simultaneously provide the greatest value of f (x1 , x2 , x3 )? Now let us revisit the optimization problem but requiring the additional constraint that the values of x1 , x2 , and x3 have to conform to some predetermined relationship. Usually these constraints are expressed as inequalities, say x1 > x2 > x3 , or with speciﬁc equations like x1 + x2 + x3 = 10. The procedure we will use is now called constrained optimization because we will optimize the given function but with the constraints speciﬁed in advance. There is one important underlying principle here. The constrained solution will never be a better solution than the unconstrained solution because we are requiring certain relationships among the terms. At best these will end up being trivial constraints and the two solutions will be identical. Usually, however, the constraints lead to a suboptimal point along the function of interest, and this is done by substantive necessity. Our task will be to maximize a k-dimensional function f (x) subject to the arbitrary constraints expressed as m functions: c1 (x) = r1 , c2 (x) = r2 , . . . , cm (x) = rm ,

where the second form is just a restatement in vector form in which the −r terms are embedded (λ denotes a transpose not a derivative). Commonly these r1 , r2 , . . . , rm values are zero (as done in the example below), which makes the expression of L(x, λ) cleaner. The λ terms in this expression are called Lagrange multipliers, and this is where the name of the method comes from. Now we take two (multidimensional) partial derivatives and set them equal to zero just as before, except that we need to keep track of λ as well: d d d L(x, λ) = f (x) + λ c(x) ≡ 0 dx dx dx d L(x, λ) = c(x) ≡ 0. dλ The derivative with respect to λ is simple because there are no λ values in the ﬁrst term. The term
d dx c(x)

⇒

d d f (x) = −λ c(x) dx dx

is just the matrix of partial derivatives of the

constraints, and it is commonly abbreviated C. Expressing the constraints in this way also means that the Lagrange multiplier component −λ of the ﬁrst line step,
d dx f (x) d dx f (x) d dx c(x) is now

just −λ C, which is easy to work with. It is interesting to contrast the ﬁnal part
d = −λ dx c(x) with unconstrained optimization at this

= 0, because it clearly shows the imposition of the constraints

on the function maximization. Finally, after taking these derivatives, we solve for the values of x and λ that result. This is our constrained answer. Just about every econometrics book has a numerical example of this process, but it is helpful to have a simple one here. Suppose we have “data” according

which is just the matrix-collected multipliers of the x terms in the constraints since there were no higher order terms to worry about here. This step can be somewhat more involved with more complex (i.e., nonlinear constraints). With the speciﬁed constraints we can now specify the Lagrange multiplier version of the function: L(x, λ) = f (x) + λ c(x) = x Ω x −2 ω x +5+ λ C x .

(1×3)(3×3)(3×1)

(1×3)(3×1)

(1×2)(2×3)(3×1)

274

Additional Topics in Scalar and Vector Calculus

Note that m = 2 and k = 3 in this example. The next task is to take the two derivatives and set them equal to zero: d L(x, λ) = 2x Ω − 2ω + λ C ≡ 0 dx d L(x, λ) = c(x) = Cx ≡ 0, dλ where c(x) = Cx comes from the simple form of the constraints here. This switch can be much more intricate in more complex speciﬁcations of restraints. These ﬁnal expressions allow us to stack the equations (they are multidimensional anyway) into the following single matrix statement: ⎡ ⎣ Ω C
1 2C

⇒

1 x Ω+ λC=ω 2

⎤⎡ ⎦⎣ x λ

⎤

⎡

0

⎦=⎣

ω 0

⎤ ⎦

where we used the transpose property such that (λ C) = C λ and (x Ω) = Ω x (given on page 116) since we want the column vector ω: 1 ω =x Ω+ λC 2 1 1 ω = (x Ω + λ C) = Ω x + C λ 2 2 ⎡ ⎤ x ⎦ = Ω 1C ⎣ 2 λ (the second row is done exactly the same way). This order of multiplication on the left-hand side is essential so that the known quantities are in the matrix and the unknown quantities are in the ﬁrst vector. If we move the matrix to the right-hand side by multiplying both sides by its inverse (presuming it is nonsingular of course), then all the unknown quantities are expressed by the known quantities. So a solution for [x λ] can now be obtained by matrix

Notice that the λ = 0 here. What this means is that our restriction actually made no impact: The solution above is the unconstrained solution. So we imposed a constraint that would have been satisﬁed anyway. The Lagrange multiplier method is actually more general than is implied here. We used very simple constraints and only a quadratic function. Much more complicated problems can be solved with this approach to constrained optimization.

To test memory retrieval, Kail and Nippold (1984) asked 8-, 12-, and 21-year-olds to name as many animals and pieces of furniture as possible in separate 7–minute intervals. They found that this number increased across the tested age range but that the rate of retrieval slowed down as the period continued. In fact, the responses often came in “clusters” of related responses (“lion,” “tiger,” “cheetah,” etc.), where the relation of time in seconds to cluster size was ﬁtted to be cs(t) = at3 + bt2 + ct + d, where time is t, and the

Exercises

279

others are estimated parameters (which differ by topic, age group, and subject). The researchers were very interested in the inﬂection point of this function because it suggests a change of cognitive process. Find it for the unknown parameter values analytically by taking the second derivative. Verify that it is an inﬂection point and not a maxima or minima. Now graph this function for the points supplied in the authors’ graph of one particular case for an 8-yearold: cs(t) = [1.6, 1.65, 2.15, 2.5, 2.67, 2.85, 3.1, 4.92, 5.55] at the points t = [2, 3, 4, 5, 6, 7, 8, 9, 10]. They do not give parameter values for this case, but plot the function on the same graph for the values a = 0.04291667, b = −0.7725, c = 4.75, and d = −7.3. Do these values appear to satisfy your result for the inﬂection point?

6.4

For the function f (x, y) = with respect to x and y.

sin(xy) cos(x+y) ,

calculate the partial derivatives

6.5

Smirnov and Ershov (1992) chronicled dramatic changes in public opinion during the period of “Perestroika” in the Soviet Union (1985 to 1991). They employed a creative approach by basing their model on the principles of thermodynamics with the idea that sometimes an encapsulated liquid is immobile and dormant and sometimes it becomes turbulent and pressured, literally letting off steam. The catalyst for change is hypothesized to be radical economic reform confronted by conservative counter-reformist policies. Deﬁne p as some policy on a metric [−1:1] representing different positions over this range from conservative (p < 1) to liberal (p > 1). The resulting public opinion support, S, is a function that can have single or multiple modes over this range, inﬂection points and monotonic areas, where the number and variety of these reﬂect divergent opinions in the population. Smirnov and Ershov found that the most convenient mathematical form here

280 was

Additional Topics in Scalar and Vector Calculus
4

S(p) =
i=1

λi pi ,

where the notation on p indicates exponents and the λi values are a series of speciﬁed scalars. Their claim was that when there are two approximately equal modes (in S(p)), this represents the situation where “the government ceases to represent the majority of the electorate.” Specify λi values to give this shape; graph over the domain of p; and use the ﬁrst and second derivatives of S(p) to identify maxima, minima, and inﬂection points. 6.6 Derive the ﬁve partial derivatives for u1 . . . u5 from the function on page 236. Show all steps. 6.7 For the function f (u, v) = point ( 1 , 1 ). 2 3 6.8 Using the function f (x, y, z) = zy 4 − xy 3 + x3 yz 2 , show that ∂3 ∂3 f (x, y, z) = f (x, y, z). ∂x∂y∂z ∂z∂y∂x 6.9 Obtain the ﬁrst, second, and third derivatives of the following functions: f (x) = 5x4 + 3x3 − 11x2 + x − 7 f (y) = √ y+ 1 y2
7

√

u + v 2 , calculate the partial derivatives

with respect to u and v and provide the value of these functions at the

h(z) = 111z 3 − 121z f (x) = (x9 )−2 g(z) = sin(z) − cos(z).

h(u) = log(u) + k 6.10

Graph the function given on page 245, the ﬁrst derivative function, and the second derivative function over [0:4]. Label the three points of interest.

Express a hyperbola with a = 9 and b = 8 in f (t) = f1 (t)i + f2 (t)j notation, and give the slope-intercept forms for the two vertical tangents.

6.21

Given f (t) = 1 i + t for t = 2.

1 t3 ,

ﬁnd the ﬁrst three orders of derivatives. Solve

6.22

For the function f (t) = e−2t i + cos(t)j, calculate the integral from 1 to 2. A number of seemingly counterintuitive voting principles can actually be proven mathematically. For instance, Brams and O’Leary (1970) claimed that “If three kinds of votes are allowed in a voting body, the probability that two randomly selected members disagree on a roll call will be maximized when one-third of the members vote ‘yes,’ one-third ‘no,’ and one third ‘abstain.”’ The proof of this statement rests on the

6.23

Exercises

283

premise that their probability of disagreement function is maximized when y = n = a = t/3, where y is the number voting yes, n is the number voting no, a is the number abstaining, and these are assumed to divide equally into the total number of voters t. The disagreement function is given by p(DG) = 2(yn + ya + na) . (y + n + a)(y + n + a − 1)

to produce an expression with only β and X terms on the left-hand side and zero on the right-hand side. Show the steps.

7
Probability Theory

7.1 Objectives We study probability for a variety of reasons. First, probability provides a way of systematically and rigorously treating uncertainty. This is an important idea that actually developed rather late in human history. Despite major contributions from ancient and medieval scholars, the core of what we use today was developed in the seventeenth and eighteenth centuries in continental Europe due to an intense interest in gambling by various nobles and the mathematicians they employed. Key scholars of this period included Pascal,Fermat, Jacob Bernoulli, Johann Bernoulli, de Moivre, and later on Euler, Gauss, Lagrange, Poisson, Laplace, and Legendre. See Stigler (1986, 1999) or Dale (1991) for fascinating accounts of this period. In addition, much of the axiomatic rigor and notation we use today is due to Keynes (1921) and Kolmogorov (1933). Interestingly, humans often think in probabilistic terms (even when not gambling), whether we are conscious of it or not. That is, we decide to cross the street when the probability of being run over by a car is sufﬁciently low, we go ﬁshing at the lakes where the probability of catching something is sufﬁciently high, and so on. So, even when people are wholly unfamiliar with the mathematical formalization of probability, there is an inclination to frame uncertain 284

7.2 Counting Rules and Permutations future events in such terms.

285

Third, probability theory is a precursor to understanding statistics and various ﬁelds of applied mathematics. In fact, probability theory could be described as “mathematical models of uncertain reality” because it supports the use of uncertainty in these ﬁelds. So to study quantitative political methodology, game theory, mathematical sociology, and other related social science subﬁelds, it is important to understand probability theory in rigorous notation. There are actually two interpretations of probability. The idea of subjective probability is individually deﬁned by the conditions under which a person would make a bet or assume a risk in pursuit of some reward. In other words, probability differs by person but becomes apparent in the terms under which a person is willing to wager. Conversely, objective probability is deﬁned as a limiting relative frequency: the long-run behavior of a nondeterministic outcome or just an observed proportion in a population. So objectivity is a function of physical observations over some period of time. In either case, the ideas discussed in this chapter apply equally well to both interpretations of probability.

7.2 Counting Rules and Permutations It seems strange that there could be different and even complicated ways of counting events or contingencies. Minor complexities occur because there are two different features of counting: whether or not the order of occurrence matters, and whether or not events are counted more than once. Thus, in combining these different considerations there are four basic versions of counting rules that are commonly used in mathematical and statistical problems. To begin, observe that the number of ways in which n individual units can be ordered is governed by the use of the factorial function from Chapter 1 (page 37): n(n − 1)(n − 2) · · · (2)(1) = n!.

286

Probability Theory

This makes sense: There are n ways to select the ﬁrst object in an ordered list, n − 1 ways to pick the second, and so on, until we have one item left and there is only one way to pick that one item. For example, consider the set {A, B, C}. There are three (n) ways to pick the ﬁrst item: A, B, or C. Once we have done this, say we picked C to go ﬁrst, then there are two ways (n − 1) to pick the second item: either A or B. After that pick, assume A, then there is only one way to pick the last item (n − 2): B. To continue, how do we organize and consider a range of possible choices given a set of characteristics? That is, if we are selecting from a group of people, we can pick male vs. female, young vs. old, college educated vs. non-college educated, and so on. Notice that we are now thinking about counting objects rather than just ordering objects as done above. So, given a list of known features, we would like a method for enumerating the possibilities when picking from such a population. Fortunately there is a basic and intuitive theorem that guides such counting possibilities. Intuitively, we want to “cross” each possibility from each characteristic to obtain every possible combination.

The Fundamental Theorem of Counting: • If there are k distinct decision stages to an operation or process, • each with its own nk number of alternatives, • then there are
k i=1

nk possible outcomes.

What this formal language says is that if we have a speciﬁc number of individual steps, each of which has some set of alternatives, then the total number of alternatives is the product of those at each step. So for 1, 2, . . . , k different characteristics we multiply the corresponding n1 , n2 , . . . , nk number of features. As a simple example, suppose we consider cards in a deck in terms of suit (n1 = 4) and whether they are face cards (n2 = 2). Thus there are 8 possible countable outcomes deﬁned by crossing [Diamonds, Hearts, Spades, Clubs]

In general, though, we are interested in the number of ways to draw a subset from a larger set. So how many ﬁve-card poker hands can be drawn from a 52-card deck? How many ways can we conﬁgure a committee out of a larger legislature? And so on. As noted, this counting is done along two criteria: with or without tracking the order of selection, and with or without replacing chosen units back into the pool for future selection. In this way, the general forms of choice rules combine ordering with counting. The ﬁrst, and easiest method, to consider is ordered, with replacement. If we have n objects and we want to pick k < n from them, and replace the choice back into the available set each time, then it should be clear that on each iteration there are always n choices. So by the Fundamental Theorem of Counting, the number of choices is the product of k values of n alternatives: n × n × · · · n = nk , (just as if the factorial ordering rule above did not decrement). The second most basic approach is ordered, without replacement. This is where the ordering principle discussed above comes in more obviously. Suppose again we have n objects and we want to pick k < n from them. There are n ways to pick the ﬁrst object, n − 1 ways to pick the second object, n − 2 ways to pick the third object, and so on until we have k choices. This decrementing of choices differs from the last case because we are not replacing items on each iteration. So the general form of ordered counting, without replacement using the two principles is n × (n − 1) × (n − 2) × · · · × (k + 1) × k = n! , (n − k)!

288

Probability Theory

Here the factorial notation saves us a lot of trouble because we can express this list as the difference between n! and the factorial series that starts with k − 1. So the denominator, (n − k)!, strips off terms lower than k in the product. A slightly more complicated, but very common, form is unordered, without replacement. The best way to think of this form is that it is just like ordered without replacement, except that we cannot see the order of picking. For example, if we were picking colored balls out of an urn, then red,white,red is equivalent to red,red,white and white,red,red. Therefore, there are k! fewer choices than with ordered, without replacement since there are k! ways to express this redundancy. So we need only to modify the previous form according to n! = (n − k)!k! n . k

Recall that this is the “choose” notation introduced on page 31 in Chapter 1. The abbreviated notation is handy because unordered,without replacement is an extremely common sampling procedure. We can derive a useful generalization of this idea by ﬁrst observing that n k = n−1 n−1 + k k−1

(the proof of this property is a chapter exercise). This form suggests successively peeling off k − 1 iterates to form a sum: n k
k

=
i=0

n−1−i . k−i

Another generalization of the choose notation is found by observing that we have so far restricted ourselves to only two subgroups: those chosen and those not chosen. If we instead consider J subgroups labeled k1 , k2 , . . . , kJ with the property that n!
J j=1 J j=1

7.2 Counting Rules and Permutations which can be denoted
n k1 ,k2 ,...,kJ

289

.

The ﬁnal counting method, unordered, with replacement is terribly unintuitive. The best way to think of this is that unordered, without replacement needs to be adjusted upward to reﬂect the increased number of choices. This form is best expressed again using choose notation: (n + k − 1)! = (n − 1)!k! Example 7.1: n+k−1 . k

Survey Sampling. Suppose we want to perform a small

survey with 15 respondents from a population of 150. How different are our choices with each counting rule? The answer is, quite different:

Ordered, with replacement: Ordered, without replacement: Unordered, without replacement: Unordered, with replacement:

nk = 15015 = 4.378939 × 1032
n! (n−k)! n k

=

150! 135!

= 2.123561 × 1032

=

150 15

= 1.623922 × 1020 .
164 15

n+k−1 k

=

= 6.59974 × 1020 .

So, even though this seems like quite a small survey, there is a wide range of sampling outcomes which can be obtained.

7.2.1 The Binomial Theorem and Pascal’s Triangle The most common mathematical use for the choose notation is in the following theorem, which relates exponentiation with counting.

Binomial Theorem: • Given any real numbers X and Y and a nonnegative integer n,
n

which gives a handy form for summarizing binomial expansions (it can obviously go on further than shown here). There are many interesting features of

7.3 Sets and Operations on Sets

291

Pascal’s Triangle. Any value in the table is the sum of the two values diagonally above. For instance, 10 in the third cell of the bottom row is the sum of the 4 and 6 diagonally above. The sum of the kth row (counting the ﬁrst row as the zero row) can be calculated by
k k j=0 j

= 2k . The sum of the diagonals from

left to right: {1}, {1}, {1, 1}, {1, 2}, {1, 3, 1}, {1, 4, 3},. . . , give the Fibonacci numbers (1,2,3,5,8,13,. . . ). If the ﬁrst element in a row after the 1 is a prime number, then every number in that row is divisible by it (except the leading and trailing 1’s). If a row is treated as consecutive digits in a larger number (carrying multidigit numbers over to the left), then each row is a power of 11: 1 = 110 11 = 111 121 = 112 1331 = 113 14641 = 114 161051 = 115 , and these are called the “magic 11’s.” There are actually many more mathematical properties lurking in Pascal’s Triangle, but these are some of the more famous.

7.3 Sets and Operations on Sets Sets are holding places. A set is a bounded collection deﬁned by its contents (or even by its lack thereof) and is usually denoted with curly braces. So the set of even positive integers less than 10 is {2, 4, 6, 8}. We can also deﬁne sets without necessarily listing all the contents if there is some criteria that deﬁnes the contents. For example, {X :0 ≤ X ≤ 10, X ∈ R}

292

Probability Theory

deﬁnes the set of all the real numbers between zero and 10 inclusive. We can read this statement as “the set that contains all values labeled X such that X is greater than or equal to zero, less than or equal to 10, and part of the real numbers.” Clearly sets with an inﬁnite number of members need to be described in this fashion rather than listed out as above. The “things” that are contained within a set are called elements, and these can be individual units or multiple units. An event is any collection of possible outcomes of an experiment, that is, any subset of the full set of possibilities, including the full set itself (actually “event” and “outcome” are used synonymously). So {H} and {T } are outcomes for a coin ﬂipping experiment, as is {H, T }. Events and sets are typically, but not necessarily, labeled with capital Roman letters: A, B, T , etc. Events can be abstract in the sense that they may have not yet happened but are imagined, or outcomes can be concrete in that they are observed: “A occurs.” Events are also deﬁned for more than one individual subelement (odd numbers on a die, hearts out of a deck of cards, etc.). Such deﬁned groupings of individual elements constitute an event in the most general sense. Example 7.2: A Single Die. Throw a single die. The event that an even

number appears is the set A = {2, 4, 6}. Events can also be referred to when they do not happen. For the example above we can say “if the outcome of the die is a 3, then A did not occur.”

7.3.1 General Characteristics of Sets Suppose we conduct some experiment, not in the stereotypical laboratory sense, but in the sense that we roll a die, toss a coin, or spin a pointer. It is useful to have some way of describing not only a single observed outcome, but also the full list of possible outcomes. This motivates the following set deﬁnition. The sample space S of a given experiment is the set that consists of all possible outcomes (events) from this experiment. Thus the sample space from ﬂipping

7.3 Sets and Operations on Sets

293

a coin is {H, T } (provided that we preclude the possibility that the coin lands on its edge as in the well-known Twilight Zone episode). Sets have different characteristics such as countability and ﬁniteness. A countable set is one whose elements can be placed in one-to-one correspondence with the positive integers. A ﬁnite set has a noninﬁnite number of contained events. Countability and ﬁniteness (or their opposites) are not contradictory characteristics, as the following examples show. Example 7.3: ably ﬁnite set, S = {1, 2, 3, 4, 5, 6}. Countably Finite Set. A single throw of a die is a count-

Note that however we deﬁne our sample space here, that deﬁnition does not affect the probabilistic behavior of the dice. That is, they are not responsive in that they do not change physical behavior due to the game being played. Example 7.5: Countably Inﬁnite Set. The number of coin ﬂips until two

angle in radians. Given a hypothetically inﬁnite precision measuring instrument, this is an uncountably inﬁnite set: S = [0:2π). We can also deﬁne the cardinality of a set, which is just the number of ¯ elements in the set. The ﬁnite set A has cardinality given by n(A), A, or A , where the ﬁrst form is preferred. Obviously for ﬁnite sets the cardinality is an integer value denoting the quantity of events (exclusive of the null set). There are, unfortunately, several ways that the cardinality of a nonﬁnite set is denoted. The cardinality of a countably inﬁnite set is denoted by ℵ0 (the Hebrew aleph character with subscript zero), and the cardinality of an uncountably inﬁnite set is denoted similarly by ℵ1 .

7.3 Sets and Operations on Sets 7.3.2 A Special Set: The Empty Set

295

One particular kind of set is worth discussing at length because it can seem confusing when encountered for the ﬁrst time. The empty set, or null set, is a set with no elements, as the names imply. This seems a little paradoxical since if there is nothing in the set, should not the set simply go away? Actually, we need the idea of an empty set to describe certain events that do not exist and, therefore the empty set is a convenient thing to have around. Usually the empty set is denoted with the Greek letter phi: φ. An analogy is helpful here. We can think of a set as a suitcase and the elements in the set are contents like clothes and books. Therefore we can deﬁne various events for this set, such as the suitcase has all shirts in it, or some similar statement. Now we take these items out of the suitcase one at a time. When there is only one item left in the set, the set is called a singleton. When this last item is removed the suitcase still exists, despite being empty, and it is also available to be ﬁlled up again. Thus the suitcase is much like a set and can contain some number of items or simply be empty but still deﬁned. It should be clear, however, that this analogy breaks down in the presence of inﬁnite sets.

7.3.3 Operations on Sets We can perform basic operations on sets that deﬁne new sets or provide arithmetic and boolean (true/false) results. The ﬁrst idea here is the notion of containment, which speciﬁes that a set is composed entirely of elements of another set. Set A is a subset of set B if every element of A is also an element of B. We also say that A is contained in B and denote this as A ⊂ B or B ⊃ A. Formally, A ⊂ B ⇐⇒ ∀X ∈ A, X ∈ B, which reads “A is a subset of B if and only if all values X that are in A are also in B.” The set A here is a proper subset of B if it meets this criteria and A = B. Some authors distinguish proper subsets from the more general kind

296

Probability Theory

where equality is allowed by using ⊂ to denote only proper subsets and ⊆ to denote the more general kind. Unfortunately this notation is not universal. Subset notation is handy in many ways. We just talked about two sets being equal, which intuitively means that they must contain exactly the same elements. To formally assert that two sets are equal we need to claim, however, that both A ⊂ B and B ⊂ A are true so that the contents of A exactly match the contents of B: A = B ⇐⇒ A ⊂ B and B ⊂ A.

Sets can be “unioned,” meaning that they can be combined to create a set that is the same size or larger. Speciﬁcally, the union of the sets A and B, A ∪ B, is the new set that contains all of the elements that belong to either A or B. The key word in this deﬁnition is “or,” indicating that the new set is inclusive. The union of A and B is the set of elements X whereby A ∪ B = {X :X ∈ A or X ∈ B}.

The union operator is certainly not conﬁned to two sets, and we can use a modiﬁcation of the “∪” operator that resembles a summation operator in its application:
n

There is an obvious relationship between unions and subsets: An individual set is always a subset of the new set deﬁned by a union with other sets:
n

A1 ⊂ A ⇐⇒ A =
i=1

Ai ,

7.3 Sets and Operations on Sets

297

and this clearly works for other constituent sets besides A1 . We can also talk about nested subsets:
n

An ↑ A =⇒ A1 ⊂ A2 ⊂ . . . An , where A =
i=1

Ai
n

An ↓ A =⇒ An ⊂ An−1 ⊂ . . . A1 , where A =
i=1

Ai .

So, for example, if A1 is the ranking minority member on the House appropriations committee, A2 is the minority party membership on the House appropriations committee, A3 is the minority party membership in the House, A4 is the full House of Representatives, A5 is Congress, and A is the government, then we can say An ↑ A. We can also deﬁne the intersection of sets, which contains only those elements found in both (or all for more than two sets of interest). So A ∩ B is the new set that contains all of the elements that belong to A and B. Now the key word in this deﬁnition is “and,” indicating that the new set is exclusive. So the elements of the intersection do not have the luxury of belonging to one set or the other but must now be a member of both. The intersection of A and B is the set elements X whereby A ∩ B = {X :X ∈ A and X ∈ B}.

Like the union operator, the intersection operator is not conﬁned to just two sets:
n

Sets also deﬁne complementary sets by the deﬁnition of their existence. The complement of a given set is the set that contains all elements not in the original set. More formally, the complement of A is the set A (sometimes denoted A ¯ or A) deﬁned by A = {X :X ∈ A}.

A special feature of complementation is the fact that the complement of the null set is the sample space, and vice versa: φ = S and S = φ. This is interesting because it highlights the roles that these special sets play: The complement of the set with everything has nothing, and the complement of the set with nothing has everything. Another common operator is the difference operator, which deﬁnes which portion of a given set is not a member of the other. The difference of A relative to B is the set of elements X whereby A \ B = {X :X ∈ A and X ∈ B}. The difference operator can also be expressed with intersection and complement notation: A\B =A∩B . Note that the difference operator as deﬁned here is not symmetric: It is not necessarily true that A \ B = B \ A. There is, however, another version called the symmetric difference that further restricts the resulting set, requiring the operator to apply in both directions. The symmetric difference of A relative to B and B relative to A is the set A B = {X :X ∈ A and X ∈ B or X ∈ B and B ∈ A}.

7.3 Sets and Operations on Sets

299

Because of this symmetry we can also denote the symmetric difference as the union of two “regular” differences: A B = (A \ B) ∪ (B \ A) = (A ∩ B ) ∪ (B ∩ A ).

Figure 7.1 illustrates set operators using a Venn diagram of three sets where the “universe” of possible outcomes (S) is given by the surrounding box. Venn diagrams are useful tools for describing sets in a two-dimensional graph. The intersection of A and B is the dark region that belongs to both sets, whereas the union of A and B is the lightly shaded region that indicates elements in A or B (including the intersection region). Note that the intersection of A or B with C is φ, since there is no overlap. We could, however, consider the nonempty sets A ∪ C and B ∪ C. The complement of A ∪ B is all of the nonshaded region, including C. Consider the more interesting region (A ∩ B) . This would be every part of S except the intersection, which could also be expressed as those elements that are in the complement of A or the complement of B, thus ruling out the intersection (one of de Morgan’s Laws; see below). The portion of A

element in the second set is also in the ﬁrst set: Suppose X ∈ (A ∪ B) ∩ (A ∪ C) so X ∈ (A ∪ B) and X ∈ (A ∪ C) If X ∈ A, then X ∈ A ∪ (B ∩ C) Or if X ∈ A, then X ∈ B and X ∈ C, since it is in the two unions but not in A. ∴ X ∈ A ∪ (B ∩ C) • Since A∪(B∩C) ⊂ (A∪B)∩(A∪C) and A∪(B∩C) ⊃ (A∪B)∩(A∪C), then every element in the ﬁrst set is in the second set, and every element in the second set is in the ﬁrst set. So the sets must be equal. The case where two sets do not have an intersection (say A and C in Figure 7.1) is important enough that it has a special name. Two sets A and B are disjoint when their intersection is empty: A ∩ B = φ. This is generalizable as well. The k sets A1 , A2 , . . . , Ak are pairwise disjoint, also called mutually exclusive, if Ai ∩ Aj = φ ∀i = j. In addition, if A1 , A2 , . . . , Ak are pairwise disjoint and we add the condition that
k i=1

Ai = S (i.e., that they cover the

sample space completely), then we say that the A1 , A2 , . . . , Ak are a partition of the sample space. For instance, the outcomes {1, 2, 3, 4, 5, 6} form a partition of S for throwing a single die because they are pairwise distinct and nothing else can occur. More formally, A1 , A2 , . . . , Ak are a partition of S iff • Ai ∩ Aj = φ ∀i = j. •
k i=1

Ai = S. Overlapping Group Memberships. Sociologists are of-

Example 7.8:

ten interested in determining connections between distinct social networks. Bonacich (1978) used set theory to explore overlapping group memberships such as sports teams, clubs, and social groups in a high school. The data are given by the following cross-listing of 18 community members and 14 social events they could possibly have attended. An “X” indicated that the

We can see that there are two relatively distinct groups here, with reasonable overlap to complicate things. Counting the social events as sets, we can ask some speciﬁc questions and make some observations. First, observe that only M = N ; they have the same members: {12, 13, 14}. A number of sets are disjoint, such as (A, J), (B, L), (D, M ), and others. Yet, the full group of sets, A : N , clearly does not form a partition due to the many nonempty intersections. In fact, there is no subset of the social events that forms a partition. How do we know this? Consider that either I or K would have to be included in the formed partition because they are the only two that include individuals 17 and 18. The set K lacks individual 16, necessitating inclusion of H or I, but these overlap with K elsewhere. Similarly, I lacks individual 15, but each of the ﬁve sets that include this individual overlap somewhere

of formal models of voting in Gill and Gainous (2002). In approval voting, voters are allowed to vote for (approve of) as many candidates as they want but cannot cast more than one vote for each candidate. Then the candidate with the most of these approval votes wins the election. Obviously this system gives voters a wide range of strategies, and these can be analyzed with a formal (mathematical) model. Given K ≥ 3 candidates, it appears that there are 2K possible strategies, from the counting rules in Section 7.2, but because an abstention has the same net effect as voting for every candidate on the ballot, the actual number of different choices is 2K − 1. We can formalize approval voting as follows: • Let w, x, y, z be the individual candidates from which a group of voters can choose, and let wP x represent a given voter’s strict preference for w over x. A multicandidate strict preference order is denoted wP xP yP z.

, denoted as W, X, Y, Z, . . . , L, and the voter is indifferent among the candidates within any single such subset while still strictly preferring every member of that subset to any of the other candidate subsets lower in the preference ordering. When = 1, the voter is called unconcerned and has no strict preference between any candidates. If = 2, then the voter is called dichotomous, trichotomous if = 3, and ﬁnally multichotomous if ≥ 4. If all voters have a dichotomous preference, then an approval voting system always produces an election result that is majority preferred, but when all preferences are not dichotomous, the result can be different. In such cases there are multiple admissible voter strategies, meaning a strategy that conforms to the available options among k alternatives and is not uniformly dominated (preferred in all aspects by the voter) by another alternative. As an example, the preference order wP x with xP y has two admissible sincere strategies where the voter may have given an approval vote for only the top alternative w or for the two top alternatives w, x. Also, with multiple alternatives it is possible for voters to cast insincere (strategic) votes: With wP xP y she prefers candidate w but might select only candidate x to make x close to w without helping y. For two given subsets A and B, deﬁne the union A ∪ B = {a : a ∈ A or a ∈ B}. A subset that contains only candidate w is denoted as {w}, the subset that contains only candidate x is denoted as {x}, the subset containing only

306

Probability Theory

candidates w and x is denoted as {w, x}, and so on. A strategy, denoted by S, is deﬁned as voting for some speciﬁed set of candidates regardless of actual approval or disapproval. Now consider the following set-based assumptions for a hypothetical voter: • P : If wP x, then {w}P {w, x}P {x}. • I: If A ∪ B and B ∪ C are nonempty, and if wIx, xIy, and wIy for all w ∈ A, x ∈ B, y ∈ C, then (A ∪ B)I(B ∪ C). • M (P ) = A1 is the subset of the most-preferred candidates under P , and L(P ) = An , the subset of the least-preferred candidates under P . Suppose we look once again at the voter who has the preference order wP xP yP z, while all other voters have dichotomous preferences, with some being sequentially indifferent (such as wIx and yIz), and some strictly prefer w and x to y and z, while the rest prefer y and z to w and x. Each of the other voters uses their unique admissible strategy, so that the aggregated preference for w is equal to that of x ,f (w) = f (x), and the aggregated preference for y is equal to that of z, f (y) = f (z). Now assume that the voter with preference wP xP yP z is convinced that there is at least a one-vote difference between w and y, f (w) ≥ f (y) + 1; therefore, {w, y} is a good strategy for this voter because a vote for w ensures that w will receive at least one more vote than x, and a vote for y ensures that y will receive at least one more vote than z. Therefore, {w, y} ensures that the wP xP yP z voter’s most-preferred candidate gets the greatest votes and wP xP yP z voter’s leastpreferred candidate gets the fewest votes.

7.4 The Probability Function The idea of a probability function is very basic and very important. It is a mapping from a deﬁned event (or events) onto a metric bounded by zero (it cannot happen) and one (it will happen with absolute certainty). Thus a probability function enables us to discuss various degrees of likelihood of occurrence in a

7.4 The Probability Function

307

systematic and practical way. Some of the language here is a bit formal, but it is important to discuss probability using the terminology in which it was codiﬁed so that we can be precise about speciﬁc meanings. A collection of subsets of the sample space S is called a sigma-algebra (also called a sigma-ﬁeld), and denoted F (a fancy looking “F”), if it satisﬁes the following three properties: (i) Null Set Inclusion. It contains the null set: φ ∈ F. (ii) Closed Under Complementation. If A ∈ F, then A ∈ F. (iii) Closed Under Countable Unions. If A1 , A2 , . . . ∈ F then
∞ i=1

Ai ∈ F.

So if A is any identiﬁed subset of S, then an associated (minimal size) sigmaalgebra is F = {φ, A, A , S}. Why do we have these particular elements? We need φ in there due to the ﬁrst condition, and we have identiﬁed A as a subset. So by the second condition we need S and A . Finally, does taking unions of any of these events ever take us out of S. Clearly not, so this is a sigmaalgebra. Interesting enough, so is F = {φ, A, A , A, A, S, A } because there is no requirement that we not repeat events in a sigma-algebra. But this is not terribly useful, so it is common to specify the minimal size sigma-algebra as we have originally done. In fact such a sigma-algebra has a particular name: a Borel-ﬁeld. These deﬁnitions are of course inherently discrete in measure. They do have corresponding versions over continuous intervals, although the associated mathematics get much more involved [see Billingsley (1995) or Chung (2000) for deﬁnitive introductions]. Example 7.10: This produces Single Coin Flip. For this experiment, ﬂip a coin once. S = {H, T } F = {φ, H, T, (H, T )}. Given a sample space S and an associated sigma-algebra F, a probability function is a mapping, p, from the domain deﬁned by F to the interval [0 :1]. This is shown in Figure 7.2 for an event labeled A in the sample space S.

It is common to identify an experiment or other probabilistic setup with the triple (also called a probability space or a probability measure space) consisting of (S, F, P ), to fully specify the sample space, sigma-algebra, and probability function applied.

7.5 Calculations with Probabilities The manipulation of probability functions follows logical and predictable rules. The probability of a union of two sets is no smaller than the probability of an intersection of two sets. These two probabilities are equal if one set is a subset of another. It also makes intuitive sense that subsets have no greater probability

Either of the ﬁrst two rules can also be restated as p(A ∪ B) + p(A ∩ B) = p(A) + p(B), which shows that the intersection is “double-counted” with naive addition. Note also that the probability of the intersection of A and B is also called the joint probability of the two events and denoted p(A, B). We can also now state a key result that is quite useful in these types of calculations.

The Theorem of Total Probability: • Given any events A and B,

310

Probability Theory

• p(A) = p(A ∩ B) + p(A ∩ B ). This intuitively says that the probability of an event A can be decomposed into to parts: one that intersects with another set B and the other that intersects with the complement of B, as shown in Figure 7.3. If there is no intersection or if B is a subset of A, then one of the two parts has probability zero.
Fig. 7.3. Theorem of Total Probability Illustrated

Probability statements can be enormously useful in political science research. Since political actors are rarely deterministic enough to predict with certainty, using probabilities to describe potential events or actions provides a means of making claims that include uncertainty.

7.5 Calculations with Probabilities

311

Jeffrey Segal (1984) looked at Supreme Court decisions to review search and seizure cases from lower courts. He constructed a model using data from all 123 Fourth Amendment cases from 1962 to 1981 to explain why the Court upheld the lower court ruling versus overturning it. The objective was to make probabilistic statements about Supreme Court decisions given speciﬁc aspects of the case and therefore to make predictive claims about future actions. Since his multivariate statistical model simultaneously incorporates all these variables, the probabilities described are the effects of individual variables holding the effects of all others constant. One of his ﬁrst ﬁndings was that a police search has a 0.85 probability of being upheld by the Court if it took place at the home of another person and only a 0.10 probability of being upheld in the detainee’s own home. This is a dramatic difference in probability terms and reveals considerable information about the thinking of the Court. Another notable difference occurs when the search takes place with no property interest versus a search on the actual person: 0.85 compared to 0.41. Relatedly, a “stop and frisk” search case has a 0.70 probability of being upheld whereas a full personal search has a probability of 0.40 of being upheld. These probabilistic ﬁndings point to an underlying distinction that justices make in terms of the personal context of the search. Segal also found differences with regard to police possession of a warrant or probable cause. A search sanctioned by a warrant had a 0.85 probability of being upheld but only a 0.50 probability in the absence of such prior authority. The probability that the Court would uphold probable cause searches (where the police notice some evidence of illegality) was 0.65, whereas those that were not probable cause searches were upheld by the Court with probability 0.53. This is not a great difference, and Segal pointed out that it is confounded with other criteria that affect the overall reasonableness of the search. One such criteria noted is the status of the arrest. If the search is performed subject to a lawful arrest, then there is a (quite impressive) 0.99 probability of being

312

Probability Theory

upheld, but only a 0.50 probability if there is no arrest, and all the way down to 0.28 if there is an unlawful arrest. What is impressive and useful about the approach taken in this work is that the author translates extensive case study into probability statements that are intuitive to readers. By making such statements, underlying patterns of judicial thought on Fourth Amendment issues are revealed.

7.6 Conditional Probability and Bayes Law Conditional probability statements recognize that some prior information bears on the determination of subsequent probabilities. For instance, a candidate’s probability of winning ofﬁce are almost certain to change if the opponent suffers a major scandal or drops out of the race. We would not want to ignore information that alters probability statements and conditional probability provides a means of systematically including other information by changing “p(A)” to “p(A|B)” to mean the probability that A occurs given that B has occurred. Example 7.12: Updating Probability Statements. Suppose a single die

is rolled but it cannot be seen. The probability that the upward face is a four is obviously one-sixth, p(x = 4) = 1 . Further suppose that you are told that 6 the value is greater than three. Would you revise your probability statement? Obviously it would be appropriate to update since there are now only three
1 possible outcomes, one of which is a four. This gives p(x = 4|x > 3) = 3 ,

which is a substantially different statement. There is a more formal means of determining conditional probabilities. Given two outcomes A and B in S, the probability that A occurs given that B occurs is the probability that A and B both occur divided by the probability that B occurs: p(A|B) = provided that p(B) = 0. p(A ∩ B) , p(B)

7.6 Conditional Probability and Bayes Law Example 7.13:

313

Conditional Probability with Dice. In rolling two dice

labeled X and Y , we are interested in whether the sum of the up faces is four, given that the die labeled X shows a three. The unconditional probability is given by p(X + Y = 4) = p({1, 3}, {2, 2}, {3, 1}) = 1 , 12

Similarly, for the set B , we get p(A|B )p(B ) = p(A ∩ B ). For any set B we know that A has two components, one that intersects with B and one that does not (although either could be a null set). So the set A can be expressed as the sum of conditional probabilities:

p(A) = p(A|B)p(B) + p(A|B )p(B ). Thus the Theorem of Total Probability can also be reexpressed in conditional notation, showing that the probability of any event can be decomposed into conditional statements about any other event. It is possible to further extend this with an additional conditional statement. Suppose now that we are interested in decomposing p(A|C) with regard to another event, B and B . We start with the deﬁnition of conditional probability, expand via the most basic form of the

= p(A|B ∩ C)p(B|C) + p(A|B ∩ C)p(B |C). It is important to note here that the conditional probability is order-dependent: p(A|B) = p(B|A). As an illustration, apparently in California the probability that a highway motorist was in the left-most lane given they subsequently received a speeding ticket is about 0.93. However, it is certainly not true that the probability that one receives a speeding ticket given they are in the left lane is also 0.93 (or this lane would be quite empty!). But can these conditional probabilities be related somehow? We can manipulate the conditional probability statements in parallel: p(A|B) = p(A ∩ B) p(B) p(B|A) = p(B ∩ A) p(A)

7.6.1 Simpson’s Paradox Sometimes conditioning on another event actually provides opposite results from what would normally be expected. Suppose, for example, a state initiated a pilot job training program for welfare recipients with the goal of improving skill levels to presumably increase the chances of obtaining employment for these individuals. The investigators assign half of the group to the job placement program and leave the other half out as a control group. The results for the full group and a breakdown by sex are provided in Table 7.1. Looking at the full group, those receiving the job training are somewhat more likely to land employment than those who did not. Yet when we look at these same people divided into men and women, the results are the opposite! Now

it appears that it is better not to participate in the job training program for both sexes. This surprising result is called Simpson’s Paradox. How can something that is good for the full group be bad for all of the component subgroups? A necessary condition for this paradox to arise is for Training and Job to be correlated with each other, and Male to be correlated with both Training and Job. So more men received job training and more men got jobs. In other words, treatment (placement in the program) is confounded with sex. Therefore the full group analysis “masks” the effect of the treatment by aggregating the confounding effect out. This is also called aggregation bias because the average of the group averages is not the average of the full population. We can also analyze this using the conditional probability version of the Total Probability Theorem. Label the events J for a job, T for training, and M for male. Looking at the table it is easy to observe that the p(J|T ) = 0.5 since a total of 200 individuals got the training and 100 acquired jobs. Does this comport with the conditioning variable? p(J|T ) = p(J|M ∩ T )p(M |T ) + p(J|M ∩ T )p(M |T ) = (0.6) = 0.5. 90 + 60 100 + 100 + (0.2) 10 + 40 100 + 100

7.7 Independence In the last section we found that certain events will change the probability of other events and that we are advised to use the conditional probability statement as a way of updating information about a probability of interest. Suppose that the ﬁrst event does not change the probability of the second event. For example, if we observe that someone drives a blue car, it does not change the probability that they will vote for the Republican candidate in the next election. Conversely, if we knew that this person voted for the Republican candidate in the last election, we would certainly want to update our unconditional probability. So how do we treat the ﬁrst case when it does not change the subsequent probability of interest? If all we are interested in is the probability of voting Republican in the next election, then it is obviously reasonable to ignore the information about car color and continue to use whatever probability we had originally assigned. But suppose we are interested in the probability that an individual votes Republican (event A) and owns a blue car (event B)? This joint probability is just the product of the unconditional probabilities and we say that A and B are independent if p(A ∩ B) = p(A)p(B). So the subject’s probability of voting for the Republican and driving a blue car is just the probability that she votes for the Republican times the probability that she owns a blue car. Put another way, the intersection occurs by chance, not by some dependent process. The idea of independence can be generalized to more than two events. A set of events A1 , A2 , . . . , Ak is pairwise independent if p(Ai ∩ Aj ) = p(Ai )p(Aj ) ∀i = j.

318

Probability Theory

This means that if we pick any two events out of the set, they are independent of each other. Pairwise independence does mean the same thing as general independence, though it is a property now attached to the pairing operation. As an example, Romano and Siegel (1986) give the following three events for two tosses of a fair coin: • Event A: Heads appears on the ﬁrst toss. • Event B: Heads appears on the second toss. • Event C: Exactly one heads appears in the two tosses. It is clear that each event here has probability of 1 . Also we can ascertain that 2 they are each pairwise independent: p(A ∩ B) = p(B ∩ C) = p(A ∩ C) = 1 = p(A)p(B) = 4 1 = p(B)p(C) = 4 1 = p(A)p(C) = 4 1 2 1 2 1 2 1 2 1 2 1 2 ,

but they are not independent as a group because p(A ∩ B ∩ C) = 0 = p(A)p(B)p(C) = 1 . 8

So independence is a property that changes with group constituency. In addition, independence can be conditional on a third event. Events A and B are conditionally independent on event C: p(A ∩ B|C) = p(A|C)p(B|C). Returning to the example above, A and B are not conditionally independent either if the condition is C because p(A ∩ B|C) = 0, but p(A|C) = 1 , 2 p(B|C) = 1 , 2

7.7 Independence and their product is clearly not zero.

319

An important theorem states that if A and B are independent, then functions of A and B operating on the same domain are also independent. As an example, suppose we can generate random (equally likely) integers from 1 to 20 (usually done on computers). Deﬁne A as the event that a prime number occurs except the prime number 2: p(A) = p(x ∈ {1, 3, 5, 7, 11, 13, 17, 19}) = and B as the event that the number is greater than 10: p(B) = p(x > 10) = 10 . 20 8 , 20

looking at conditional probabilities is with game or decision trees (depending on the speciﬁc application). Bueno de Mesquita, Newman, and Rabushka (1985) suggested this method for forecasting political events and applied it to the test case of Hong Kong’s reversal to China (PRC). The decision tree in Figure 7.4 shows the possible decisions and results for a third party’s decision to support either the challenger or the government in an election, given that they have ruled out doing nothing. Suppose we were trying to anticipate the behavior of this party and the resulting impact on the election (presumably the support of this party matters).

Hypothetical probabilities of each event at the nodes are given in the ﬁgure. For instance, the probability that the challenger wins is 0.03 and the probability that the challenger loses is 0.97, after the third party has already

thrown its support behind the government. Correspondingly, the probability that challenger wins is 0.20 and the probability that the challenger loses is 0.80, after the third party has already thrown its support behind the challenger. So these are conditional probabilities exactly as we have studied before but now show diagrammatically. In standard notation these are p(C|SC) = 0.20 p(C|SG) = 0.03 p(G|SC) = 0.80 p(G|SG) = 0.97,

where we denote C as challenger wins, G as government wins, SG for the third party supports the government, and SC for the third party supports the challenger. As an analyst of future Hong Kong elections, one might be estimating the probability of either action on the part of the third party, and these are given here as p(SC) = 0.65 and p(SG) = 0.35. In other words, our study indicates that this party is somewhat more inclined to support the opposition. So what does this mean for predicting the eventual outcome?

7.8 Odds

321

We have to multiply the probability of getting to the ﬁrst node of interest times the probability of getting to the outcome of interest, and we have to do this for the entire tree to get the full picture. Looking at the ﬁrst (top) path of the tree, the probability that the challenger wins when the third party supports them is (0.20)(0.65) = 0.13. Conceptually we can rewrite this in the form p(C|SC)p(SC), and we know that this is really p(C, SC) = p(C|SC)p(SC) from the deﬁnition of conditional probability on page 312. So the probabilities at the nodes in the tree ﬁgure are only “local” in the sense that they correspond to events that can happen at that geographic point only since they assume that the tree has been traversed already to that point. This makes a nice point about conditional probability. As we condition on events we are literally walking down a tree of possibilities, and it is easy to see that such trees can be much wider (more decisions at each node) and much deeper (more steps to the ﬁnal event of interest).

7.8 Odds Sometimes probability statements are reexpressed as odds. In some academic ﬁelds this is routine and researchers view these to be more intuitive than standard probability statements. To simplify for the moment, consider a sample space with only two outcomes: success and failure. These can be deﬁned for any social event we like: wars, marriages, group formations, crimes, and so on. Deﬁning the probability of success as p = p(S) and the probability of failure as q = 1 − p = p(F ), the odds of success are odds(S) = p . q

Notice that while the probability metric is conﬁned to [0:1], odds are positive but unbounded. Often times odds are given as integer comparisons: “The odds of success are 3 to 2” and notated 3 : 2, and if it is convenient, making the second number 1 is particularly intuitive. Converting probabilities to odds does not lose any information and probability information can be recovered. For instance,

These calculations are more involved but essentially the same for more than two possible outcomes. Example 7.15: Parental Involvement for Black Grandmothers. Pear-

son et al. (1990) researched the notion that black grandparents, typically grandmothers, living in the same household are more active in parenting their grandchildren than their white counterparts. The authors were concerned with testing differences in extended family systems and the roles that members play in child rearing. They obtained data on 130 black families where the grandmother lived in the house and with reported levels of direct parenting for the grandchildren. Three dichotomous (yes/no) effects were of direct interest here. Supportive behavior was deﬁned as reading bedtime stories, playing games,

or doing a pleasant outing with the child. This was the main variable of interest to the researchers. The ﬁrst supporting variable was punishment
behavior, which was whether or not the grandmother punished the child on

misbehavior. The second supporting variable was controlling behavior, which meant that the grandmother established the rules of behavior for the child. Pearson et al. looked at a wide range of explanations for differing levels of grandmother involvement, but the two most interesting ﬁndings related to these variables. Grandmothers who took the punishment behavior role versus not taking on this role had an odds ratio of 2.99 : 1 for exhibiting supportive behavior. Furthermore, grandmothers who took the controlling behavior role versus not doing so had an odds ratio of 5.38 : 1 for exhibiting supportive behavior. Therefore authoritarian behavior strongly predicts positive parenting interactions.

A fair coin is tossed 20 times and produces 20 heads. What is the probability that it will give a tails on the 21st try?

7.2

At the end of a legislative session there is time to vote on only three more bills. Pending there are 12 bills total: 6 on foreign policy, 4 on judicial affairs, and 2 on energy policy. Given equally likely selection, what is the probability that • exactly one foreign policy bill will receive a vote? • all three votes will be on foreign policy? • one of each type will receive a vote? • no judicial affairs bills will receive a vote?

7.3 7.4

Prove that

n k

=

n−1 k

+

n−1 k−1

.

Develop two more rows to the Pascal’s Triangle given on page 290 and show that the “magic 11” property holds.

7.5

Suppose you had a pair of four-sided dice (they exist), so the set of possible outcomes from summing the results from a single toss is {2, 3, 4, 5, 6, 7, 8}. Determine the probability of each of these outcomes.

7.6 7.7 7.8

For some set A, explain A ∪ A and A ∩ A. Prove de Morgan’s Laws for two sets A and B. The probability that marriage A lasts 20 years is 0.4, the probability that marriage B lasts 20 years is 0.25, and the probability that marriage C lasts 20 years is 0.8. Under the assumption of independence, calculate the probability that they all last 20 years, the probability that none of them last 20 years, and the probability that at least one lasts 20 years.

7.9

If (D|H) = 0.5, p(D) = 1, and p(H) = 0.1, what is the probability that H is true given D?

Exercises 7.10

325

In rolling two dice labeled X and Y , what is the probability that the sum of the up faces is four, given that either X or Y shows a three. Show that the Theorem of Total Probability also works when either of the two sets is the null set.

7.11

7.12

You are trying to form a coalition cabinet from the six major Italian political parties (given by just initials here). There are three senior members of DC, ﬁve senior members of the PCI, four senior members of PSI, two senior members of PSDI, ﬁve senior members of PRI, and three senior members of PLI, all vying for positions in the cabinet. How many ways could you choose a cabinet composed of two from each party?

7.13

If events A and B are independent, prove that A and B are also independent. Can you say that A and A are independent? Show your logic.

7.14

Suppose we roll a single die three times. What is the probability of: (a) three sixes? (b) exactly one six? (c) the sum of the three rolls is 4? (d) the sum of the three rolls is a prime number?

Al and George want to have a “town hall” style debate. There are only 100 undecided voters in the entire country from which to choose an audience. If they want 90 of these people, how many different sets of 90 can be chosen (unordered, without replacement)?

7.17

Someone claims they can identify four different brands of beer by taste. An experiment is set up (off campus of course) to test her ability in which she is given each of the four beers one at a time without labels or any visual identiﬁcation. (a) How many different ways can the four beers be presented to her one at a time? (b) What is the probability that she will correctly identify all four brands simply by guessing? (c) What is the probability that she will incorrectly identify only one beer simply by guessing (assume she does not duplicate an answer)? (d) Is the event that she correctly identiﬁes the second beer disjoint with the event that she incorrectly identiﬁes the fourth beer?

7.18

A company has just placed an order with a supplier for two different products. Let E = the event that the ﬁrst product is out of stock F = the event that the second product is out of stock Suppose that p(E) = 0.3, p(F ) = 0.2, and the probability that at least one is out of stock is 0.4. (a) What is the probability that both are out of stock? (b) Are E and F independent events?

Exercises

327

(c) Given that the ﬁrst product is in stock, what is the probability that the second is also? 7.19 Suppose your professor of political theory put 17 books on reserve in the library. Of these, 9 were written by Greek philosophers and the rest were written by German philosophers. You have already read all of the Greeks, but none of the Germans, and you have to ask for the books one at a time. Assuming you left the syllabus at home, and you have to ask for the books at random (equally likely) by call letters: (a) What is the probability that you have to ask for at least three books before getting a German philosopher? (b) What is the highest possible number of times you would have to ask for a book before receiving a German philosopher? 7.20 Suppose you ﬁrst ﬂipped a quarter, then ﬂipped a dime, and then ﬂipped a nickel. (a) What is the probability of getting a heads on the nickel given you get tails on the quarter and heads on the dime? (b) Are the events getting a tails on the quarter and getting a tails on the nickel disjoint? (c) Are the events getting a tails on the dime and a heads on the dime independent? 7.21 In a given town, 40% of the voters are Democrats and 60% are Republican. The president’s budget is supported by 50% of the Democrats and 90% of the Republicans. If a randomly (equally likely) selected voter is found to support the president’s budget, what is the probability that they are a Democrat? 7.22 At Cafe Med on Telegraph Avenue, 60% of the customers prefer regular coffee and 40% prefer decaffeinated. (a) Among 10 randomly (equally likely) selected customers, what is the probability that at most 8 prefer regular coffee?

328

Probability Theory (b) Cafe Med is about to close and only has 7 cups of regular left but plenty of decaffeinated. What is the probability that all 10 remaining customers get their preference?

7.23

Assume that 2% of the population of the United States are members of some extremist militia group, (p(M ) = 0.02), a fact that some members might not readily admit to an interviewer. We develop a survey that is 95% accurate on positive classiﬁcation, p(C|M ) = 0.95, and 97% accurate on negative classiﬁcation, p(C |M ) = 0.97. Using Bayes’ Law, derive the probability that someone positively classiﬁed by the survey as being a militia member really is a militia member. (Hint: Draw a Venn diagram to get p(C) and think about the Theorem of Total Probability).

7.24

Suppose we have two urns containing marbles. The ﬁrst urn contains 6 red marbles and 4 green marbles, and the second urn contains 9 red marbles and 1 green marble. Take one marble from the ﬁrst urn (without looking at it) and put it in the second urn. Then take one marble from the second urn (again without looking at it) and put it in the ﬁrst urn. What is the probability of now drawing a red marble from the ﬁrst urn?

7.25

Corsi (1981) examined political terrorism, responses to terrorist acts, and the counter-response of the terrorists for 1970 to 1974. For the type of events where a target is seized and held at an unknown site (like kidnapping) he found that 55.6% (n = 35) of the time the government involved capitulated. Given that this happened, 2.9% of the time the terrorists increased their demands, 91.4% of the time there was no change in these demands, and 5.7% of the time contact is lost. Of these three events, the number of times that there was known to be no damage or death was 1, 31, and 1, respectively. Construct a tree diagram that contains the conditional probabilities at each level.

Exercises 7.26

329

Suppose there are three possible outcomes for an experiment: A, B, and C. If the odds of A over B are 9:1 and the odds of B over C are 3:2, what are the probabilities of the three events?

8
Random Variables

8.1 Objectives This chapter describes the means by which we label and treat known and unknown values. Basically there are two types of observable data, and the abstract terminology for yet-to-be observed values should also reﬂect this distinction. We ﬁrst talk here about the levels of measurement for observed values where the primary distinction is discrete versus continuous. We will then see that the probability functions used to describe the distribution of such variables preserves this distinction. Many of the topics here lead to the use of statistical analysis in the social sciences.

8.2 Levels of Measurement It is important to classify data by the precision of measurement. Usually in the social sciences this is an inﬂexible condition because many times we must take data “as is” from some collecting source. The key distinction is between discrete data, which take on a set of categorical values, and continuous data, which take on values over the real number line (or some bounded subset of it). The difference can be subtle. While discreteness requires countability, it 330

8.2 Levels of Measurement

331

can be inﬁnitely countable, such as the set of positive integers. In contrast, a continuous random variable takes on uncountably inﬁnite values, even if only in some range of the real number line, like [0 : 1], because any interval of the real line, ﬁnitely bounded or otherwise, contains an inﬁnite number of rational and irrational numbers. To see why this is an uncountably inﬁnite set, consider any two points on the real number line. It is always possible to ﬁnd a third point between them. Now consider ﬁnding a point that lies between the ﬁrst point and this new point; another easy task. Clearly we can continue this process inﬁnitely and can therefore never fully count the number of values between any two points on the real number line. It is customary to divide levels of measurement into four types, the ﬁrst two of which are discrete and the second two of which are either continuous or discrete. Stevens (1946, 1951) introduced the following (now standard) language to describe the four differing measurement precisions for observed data. Nominal. Nominal data are purely categorical in that there is no logical way to order a set of events. The classic example is religions of the world: Without specifying some additional criteria (such as age, origin, or number of adherents) there is no nonnormative way to rank them. A synonym for nominal is polychotomous, and sometimes just “categorical” is used as well, but this latter term can be confusing in this context because there are two types of categorical data types. In addition, dichotomous (yes/no, on/off, etc.) data are also considered nominal, because with two outcomes ordering does not change any interpretive value. Examples of nominal data include • male/female • war/peace • regions of the U.S. • political parties • football jersey numbers • telephone numbers.

Ordinal. Ordinal data are categorical (discrete) like nominal data, but with the key distinction that they can be ranked (i.e., ordered). While we could

332

Random Variables

treat ordinal data as if they were just nominal, both are discrete, we would be ignoring important qualitative information about how to relate categories. Examples include • seniority in Congress (it is naive to treat years in ofﬁce more literally); • lower/middle/upper socio-economic class; • Likert scales (agree/no opinion/disagree, and other variants); • Guttman scale (survey respondents are presented with increasingly hard-toagree-with statements until they disagree); • levels of democratization. Often ordinal data are the result, not of directly measured data, but artiﬁcial indices created by researchers to measure some latent characteristic. For instance, sociologists are sometimes concerned with measuring tolerance within societies. This may be tolerance of different races, cultures, languages, sexual orientations, or professions. Unfortunately it is not possible to measure such underlying attitudes directly either by observation or a single survey query. So it is common to ask a multitude of questions and combine the resulting information into an index: multi-item measures from accumulating scores to create a composite variable. Political scientists do this to a slightly lesser extent when they are concerned with levels of freedom, volatility, political sophistication, ideology, and other multifaceted phenomenon. Interval. The key distinction between interval data and ordinal data is that interval data have equal spacing between ordered values. That is, the difference between 1 and 2 is exactly the difference between 1001 and 1002. In this way the ordering of interval data has a higher level of measurement, allowing more precise comparisons of values. Consider alternatively the idea of measuring partisanship in the U.S. electorate from a survey. It may or may not be the case that the difference between somewhat conservative and conservative is the same as the distance between conservative and extremely conservative. Therefore it would incorrect, in general, to treat this as interval data.

8.2 Levels of Measurement

333

Interval data can be discrete or continuous, but if they are measured on the real number line, they are obviously continuous. Examples of interval measured data include • temperature measured in Fahrenheit or Celsius; • a “feeling thermometer” from 0 to 100 that measures how survey respondents feel about political ﬁgures; • size of legislature (it does not exist when n = 0); • time in years (0 AD is a construct). Ratio. Ratio measurement is exactly like interval measurement except that the point at zero is “meaningful.” There is nothing in the deﬁnition of interval measure that asserts that zero (if even included in the set of possible values) is really an indicator of a true lack of effect. For example, Fahrenheit and Celsius both have totally arbitrary zero points. Zero Fahrenheit is attributed to the coldest level that the Dutch scientist Daniel Fahrenheit could make a water and salt solution in his lab (he set 100 degrees as his own measured body temperature at that same time). Zero Celsius was established a bit more scientiﬁcally by the Swedish astronomer Anders Celsius as the point where water alone freezes (and, as is generally known, 100 degrees Celsius is the point where water boils). While the zero point in both cases has some physical basis, the choice of water and salt is completely arbitrary. Suppose we were to meet developed creatures from Jupiter. It is likely that their similarly constructed scales would be based on ammonia, given the dominant chemical content of the Jovian atmosphere. So what does this limitation to interval measure mean for these two scales? It means that ratios are meaningless: 80 degrees is not twice as hot as 40 degrees (either scale) because the reference point of true zero does not exist. Is there a measure of temperature that is ratio? Fortunately, yes; zero degrees Kelvin is the point at which all molecular motion in matter stops. It is an absolute zero because there can be no lower temperature. Ratio measurement is useful speciﬁcally because it does allow direct ratio

Ratio measurement, like interval measurement, can be either discrete or continuous. There is also a subtle distinction between interval and ratio measurement that sometimes gets ignored. Previously the example of the size of a legislature was given as interval rather than ratio. Although zero has some meaning in this context, a legislature with zero members does not exist as a legislature and this then voids the utility of the zero point so that it has no practical meaning. Notice that the scale of measurement here is ascending in precision (and actually in desirability as well). This direction is easy to remember with the mnemonic NOIR, as the French word for the color black. Any level of measurement, except nominal, can always be reduced to a lower one simply by ignoring some information. This makes sense at times when the deﬁning characteristic is suspicious or perhaps measured poorly.

8.3 Distribution Functions Distribution functions are central in statistical and probabilistic analyses. They provide a description of how we believe that some data-generating process is operating. Since all models are simpliﬁcations of reality, probability statements are really just rough simpliﬁcations of the way things actually work. Nobody believes that events like wars, marriages, or suicides occur for underlying mathematical reasons. Yet, we can learn a lot about natural, social, and political phenomenon by ﬁtting a parsimonious description based on probability statements. What do we mean by the word random here? We will shortly review a formal and rigorous deﬁnition, but it helps to ﬁrst think intuitively about the meaning. Everyone is accustomed to the idea that some events are more likely

8.3 Distribution Functions

335

to occur than others. We are more likely to eat lunch today than to win the lottery; it is more likely to rain on a given Glasgow winter day than to be sunny; the stock market is more likely to rise on good economic news than to fall. The key idea here is the expression of relative difference between the likelihood of events. Probability formalizes this notion by standardizing such comparisons to exist between zero and one inclusive, where zero means the event will absolutely not occur and one means that the event will certainly occur.† Every other assignment of probability represents some measure of uncertainty and is between these two extreme values, where higher values imply a greater likelihood of occurrence. So probability is no more than a conventional standard for comparisons that we all readily make. Example 8.1: Measuring Ideology in the American Electorate. As

a simple example, consider a question from the 2002 American National Election Study that asks respondents to identify their ideology on a sevenpoint scale that covers extremely liberal, liberal, slightly liberal, moderate, slightly conservative, conservative, and extremely conservative. A total of 1245 in the survey placed themselves on this scale (or a similar one that was merged), and we will assume for the moment that it can be treated as an interval measure. Figure 8.1 shows a histogram of the ideology placements in the ﬁrst panel. This histogram clearly demonstrates the multimodality of the ideology placements with three modes at liberal, moderate, and conservative. The second panel of Figure 8.1 is a “smoothed” version of the histogram, called a density plot. The y-axis is now rescaled because the area under this curve is normalized to integrate to one. The point of this density plot is to estimate an underlying probability structure that supposedly governs the placement of ideology. The key point is that we do not really believe that a mathematical law of some sort determines political ideology, but hopefully
† There is actually a subtlety here. Impossible events have probability zero and exceedingly, exceedingly unlikely events also have probability zero for various reasons. The same logic exists for probability one events. For our purposes these distinctions are not important, however.

336

Random Variables
Fig. 8.1. Self-Identified Ideology, ANES 2002

Frequencies

0

100

200

300

EL

L

SL

Histogram

M

SC

C

EC

Density

0.00

0.10

0.20

Smoothed Ideology Curve

by constructing this density plot we have captured an accurate probabilistic description of the underlying structure that determines the observed phenomenon. So a probability function can be taken as a description of the long-run relative frequencies. There is actually an old simmering controversy behind the interpretation of these probability functions. One group, who are called “frequentists,” believe that probability statements constitute a long-run likelihood of occurrence for speciﬁc events. Speciﬁcally, they believe that these are objective, permanent

8.3 Distribution Functions

337

statements about the likelihood of certain events relative to the likelihood of other events. The other group, who are usually termed “Bayesians” or “subjectivists,” believe that all probability statements are inherently subjective in the sense that they constitute a certain “degree of belief” on the part of the person making the probability statement. More literally, this last interpretation constitutes the odds with which one would be willing to place a bet with his or her own money. There are strong arguments for both perspectives, but to a great degree this discussion is more philosophical than practical.

8.3.1 Randomness and Variables Randomness does not actually mean what many (nonconcerned) people think that it means. Colloquially “random” is synonymous with equally likely outcomes, and that is how we explicitly treated randomness in part of the last chapter. So it may be common to describe the experiment of rolling a single fair die as random because each of the six sides are equally likely. But think about how restrictive this deﬁnition would be if that was the only type of uncertainty allowed: All countries are equally likely to go to war, all eligible citizens in a given country are equally likely to vote in the next election, every surveyed household is equally likely to be below the poverty level. What randomness really means is that the outcome of some experiment (broadly deﬁned) is not deterministic: guaranteed to produce a certain outcome. So, as soon as the probability for some described event slips below one or above zero, it is a random occurrence. Thus if the probability of getting a jackpot on some slot machine is 0.001 for a given pull of the handle, then it is still a random event. Random variables describe unoccurred events abstractly for pedagogical purposes. That is, it is often convenient to describe the results of an experiment before it has actually occurred. In this way we may state that the outcome of a coin ﬂip is designated as X. So for a fair coin we can now say that the probability that X is going to be a heads on the next ﬂip is 0.5.

338

Random Variables

Formally, a random variable, often denoted with a capital Latin letter such as X or Y , is a function that maps the sample space on which it is “created” to a subset of the real number line (including possibly the whole real number line itself). So we now have a new sample space that corresponds not to the physical experiment performed but to the possible outcomes of the random variable itself. For example, suppose our experiment is to ﬂip a coin 10 times (n = 10). The random variable X is deﬁned to be the number of heads in these 10 tosses. Therefore, the sample space of a single iteration of the experiment is {H, T }, and the sample space of X is {0, 1, 2, . . . , 10}. Random variables provide the connection between events and probabilities because they give the abstraction necessary for talking about uncertain and unobserved events. Sometimes this is as easy as mapping a probability function to a set of discrete outcomes in the sample space of the random variable. To continue the example, we can calculate (more details below) the probability that X takes on each possible value in the sample space determined by 10 ﬂips of a fair coin:
X p(X) X p(X) 5 0.2461 0 0.0010 6 0.2051 1 0.0098 7 0.1172 2 0.0439 8 0.0439 3 0.1172 9 0.0098 4 0.2051 10 0.0010

Here each possible event for the random variable X, from 0 heads to 10 heads, is paired with a speciﬁc probability value. These probability values necessarily sum to unity because one of the 11 values must occur.

8.3.2 Probability Mass Functions When the state space is discrete, we can assign probability values to each single event, even if the state space is countably inﬁnite (discrete with an inﬁnite number of distinct outcomes). So, for example, in the case of ﬂipping a possibly unfair coin, we can assign a probability to heads, p(H), and therefore a

8.3 Distribution Functions complementary probability to tails, p(T ).

339

The essence of a probability mass function is that it assigns probabilities to each unique event, such that the Kolmogorov Axioms still apply. It is common to abbreviate the expression “probability mass function” with “PMF” as a shorthand. We denote such PMF statements f (x) = p(X = x), meaning that the PMF f (x) is a function which assigns a probability for the random variable X equaling the speciﬁc numerical outcome x. This notation often confuses people on introduction because of the capital and lower case notation for the same letter. It is important to remember that X is a random variable that can take on multiple discrete values, whereas x denotes a hypothetical single value. Customarily, the more interesting versions of this statement insert actual values for x. So, for instance, the coin-ﬂipping statements above are more accurately given as: p(X = H) = 1 − p(X = T ). Notice that in this setup the coin need not be “fair” in the sense that the probability expression accommodates weighted versions such as p(X = H) = 0.7 and p(X = T ) = 0.3.

8.3.3 Bernoulli Trials The coin-ﬂipping example above is actually much more general than it ﬁrst appears. Suppose we are studying various political or social phenomenon such as whether a coup occurs, whether someone votes, cabinet dissolution or continuation, whether a new person joins some social group, if a bill passes or fails, and so on. These can all be modeled as Bernoulli outcomes whereby the occurrence of the event is assigned the value “1,” denoting success, and the nonoccurrence of the event is assigned the value “0,” denoting failure. Success and failure can be an odd choice of words when we are talking about coups or

340

Random Variables

wars or other undesirable events, but this vocabulary is inherited from statistics and is quite well entrenched. The basic premise behind the Bernoulli PMF is that the value one occurs with probability p and the value zero occurs with probability 1 − p. Thus these outcomes form a partition of the sample space and are complementary. If x denotes the occurrence of the event of interest, then p(x) = p and p(x ) = 1 − p.

So it is natural to want to estimate p given some observations. There are many ways to do this, but the most direct is to take an average of the events (this process actually has substantial theoretical support, besides being quite simple). So if we ﬂip a coin 10 times and get 7 heads, then a reasonable estimate of p is 0.7.

8.3.4 Binomial Experiments The binomial PMF is an extension to the Bernoulli PMF whereby we simultaneously analyze multiple Bernoulli trials. This is historically called an experiment because it was originally applied to controlled tests. The random variable is no longer binary but instead is the sum of the observed binary Bernoulli events and is thus a count: Y = leading to a particular sum. To make things easy to start with, suppose we are studying three senior legislators who may or may not be retiring at the end of the current term. We believe that there is an underlying shared probability p governing their independent decisions and denote the event of retiring with R. We thus have a number of events E dictated by the three individual values and their ordering, which produce a sum bounded by zero (no retirements) and three (all retirements). These events are given in the ﬁrst column of Table 8.1 with their respective sums in the second column.
n i=1

Xi . A complication to this Bernoulli

extension is ﬁguring out how to keep track of all of the possible sets of results

The third column of Table 8.1 gives the probabilities for each of these events, which is simply rewritten in the fourth column to show the structure relating Y and the number of trials, 3. Since the retirement decisions are assumed independent, we can simply multiply the underlying individual probabilities according the deﬁnition given on page 317 in Chapter 7 to get the joint probability of occurrence. If we lump these together by the term Y , it is easy to see that there is one way to get zero retirements with probability (1 − p)3 , three ways to get one retirement with probability p(1 − p)2 , three ways to get to two retirements with probability p2 (1 − p), and one way to get three retirements with probability p3 . Recalling that this is really choosing by unordered selection without replacement (page 288), we can note that the ways to get each of these events is given by the expression
3 y

.

There is also a clear pattern to binomial distribution probabilities. The outcome that receives the highest probability is the one that corresponds to n × p (more on this calculation below),and probabilities slope down in both directions from this modal point. A particularly elegant picture results from experiments with so-called “fair” probabilities. Suppose we ﬂip such a fair coin 10 times. What is the full probability outcome for the number of heads? We can obviously make these calculations in exactly the same way that was done in the example above. If such probabilities were then graphed with a barplot, the result would look like Figure 8.2.

342

Random Variables

This is really useful because we can now state the binomial PMF for this particular “experiment” for the sum of retirement events: p(Y = y) = 3 y p (1 − p)3−y , y

We can denote this general form or any speciﬁc case with the shorthand B(n, p). So in this way we deﬁne a general form for the binomial PMF and a mechanism for specifying speciﬁc cases, that is, B(10, 5), B(100, 75), and so on. Example 8.2: Binomial Analysis of Bill Passage. Suppose we know that

a given legislature has a 0.7 probability of passing routine bills (perhaps from historical analysis). If 10 routine bills are introduced in a given week, what is the probability that:

(i) Exactly 5 bills pass? We can simply plug three values into the binomial

Fig. 8.2. Example Binomial Probabilities
p(y|n=10,p=0.5)
0.2461

0.2051

0.2051

0.1172

0.1172

0.0439

0.0439

0.0098 0.001 0 1 2 3 4 5 6 7 8

0.0098 0.001 9 10

8.3 Distribution Functions PMF for this question:

343

p(Y = 5|n = 10, p = 0.7) =

10 (0.7)5 (1 − 0.7)10−5 5

= (252)(0.16807)(0.00243) = 0.10292.

(ii) Less than three bills pass? The most direct method is to add up the three probabilities associated with zero, one, and two occurrences:

(iii) Nine or less bills pass? The obvious, but time-consuming way to answer this question is the way the last answer was produced, by summing up all (nine here) applicable individual binomial probabilities. However, recall that because this binomial PMF is a probability function, the sum of the probability of all possible events must be one. So this

344

Random Variables suggests the following trick:
9

p(Y ≤ 9|10, 0.7) =
i=1 10

p(Y = i|10, 0.7)

=
i=1

p(Y = i|10, 0.7) − p(Y = 10|10, 0.7)

= 1 − p(Y = 10|10, 0.7) =1− 10 (0.7)10 (1 − 0.7)10−10 10

= 1 − 0.02825 = 0.97175.

8.3.5 Poisson Counts Suppose that instead of counting the number of successes out of a ﬁxed number of trials, we were concerned with the number of events (which can still be considered successes, if one likes) without an upper bound. That is, we might consider the number of wars on a continent, the number of alliances between protest groups, or the number of cases heard by a court. While there may be some practical upper limit imposed by the number of hours in day, these sorts of events are usually counted as if there is no upper bound because the number of attempts is unknown a priori. Another way of thinking of such count data is in terms of durations: the length of time waiting for some prescribed event. If the probability of the event is proportional to the length of the wait, then the length of wait can be modeled with the Poisson PMF. This discrete distributional form is given by p(y|λ) = e−λ λy , y ∈ I + , λ ∈ R+ . y!

The assumption of proportionality is usually quite reasonable because over longer periods of time the event has more “opportunities” to occur. Here the single PMF parameter λ is called the intensity parameter and gives the expected

8.3 Distribution Functions

345

number of events. This parametric form is very useful but contains one limiting feature: λ is also assumed to be the dispersion (variance, deﬁned on page 366) of the number of events. Example 8.3: Poisson Counts of Supreme Court Decisions. Recent

Supreme Courts have handed down roughly 8 unanimous decisions per term. If we assume that λ = 8 for the next Court, then what is the probability of observing: (i) Exactly 6 decisions? Plugging these values into the Poisson PMF gives p(Y = 6|λ = 8) = e−8 86 = 0.12214. 6!

(ii) Less than three decisions? Here we can use a sum of three events:
2

p(Y < 3|λ = 8) =
i=0

e−8 8yi yi !

= 0.00034 + 0.00268 + 0.01073 = 0.01375. (iii) Greater than 2 decisions? The easiest way to get this probability is with the following “trick” using the quantity from above: p(Y > 2|λ = 8) = 1 − p(Y < 3|λ = 8) = 1 − 0.01375 = 0.98625. The Poisson distribution is quite commonly applied to events in international systems because of the discrete nature of many studied events. The two examples that follow are typical of simple applications. To directly apply the Poisson distribution two assumptions are required: • Events in different time periods are independent. • For small time periods, the probability of an event is proportional to the length of time passed in the period so far, and not dependent on the number of previous events in this period.

346

Random Variables

These are actually not as restrictive as it might appear. The ﬁrst condition says that rates of occurrence in one time period are not allowed to inﬂuence subsequent rates in another. So if we are measuring conﬂicts, the outset of a widespread war will certainly inﬂuence the number of actual battles in the next period, and this thus obviates the continued use of the same Poisson parameterization as was used prior to the war. The second condition means that time matters in the sense that, for some bounded slice of time, as the waiting time increases, the probability of the event increases. This is intuitive; if we are counting arrivals at a trafﬁc light, then it is reasonable to expect more arrivals as the recording period is extended. Example 8.4: Modeling Nineteenth-Century European Alliances. Mc-

Gowan and Rood (1975) looked at the frequency of alliance formation from 1814 to 1914 in Europe between the “Great Powers:” Austria-Hungary, France, Great Britain, Prussia-Germany, and Russia. They found 55 alliances during this period that targeted behavior within Europe between these powers and argued that the observed pattern of occurrence follows the Poisson distribution. The mean number of alliances per year total is 0.545, which they used as their empirical estimate of the Poisson parameter, λ = 0.545. If we use this value in the Poisson PMF, we can compare observed events against predicted events: Alliances/Year Observed Predicted y=0 61 58.6 y=1 31 31.9 y=2 6 8.7 y≥3 3 1.8

This seems to ﬁt the data reasonably well in terms of prediction. It is important to recall that λ = 0.545 is the intensity parameter for ﬁve countries to enter into alliances, so assuming that each country is equally likely, the intensity parameter for an individual country is λi = 0.545/5 = 0.109. Example 8.5: Poisson Process Model of Wars. Houweling and Kun´ e

(1984) looked at wars as discrete events in a paper appropriately titled “Do

8.3 Distribution Functions

347

Outbreaks of War Follow a Poisson-Process?” They compared 224 events of international and civil wars from 1816 to 1980 to that predicted by estimating the Poisson intensity parameter with the empirical mean: λ = 1.35758. Evidence from Figure 8.3 indicates that the Poisson assumption ﬁts the data quite nicely (although the authors quibbled about the level of statistical signiﬁcance).
Fig. 8.3. Poisson Probabilities of War
70 50 60

predicted

observed

Counts

0

10

20

30

40

0

1

2

3

4

5

6

7

Number of Wars per Year, 1816−1980

Interestingly, the authors found that the Poisson assumption ﬁts less well when the wars were disaggregated by region. The events in the Western Hemisphere continue to ﬁt, while those in Europe, the Middle East, and Asia deviate from the expected pattern. They attribute this latter effect to not meeting the second condition above.

8.3.6 The Cumulative Distribution Function: Discrete Version If X is a discrete random variable, then we can deﬁne the sum of the probability mass to the left of some point X = x: the mass associated with values less

348 than X. Thus the function

Random Variables

F (x) = p(X ≤ x) deﬁnes the cumulative distribution function (CDF) for the random variable X. A couple of points about notation are worth mentioning here. First, note that the function uses a capital “F” rather than the lower case notation given for the PMF. Sometimes the CDF notation is given with a random variable subscript, FX (x), to remind us that this function corresponds to the random variable X. If the values that X can take on are indexed by order: x1 < x2 < · · · < xn , then the CDF can be calculated with a sum for the chosen point xj :
j

F (xj ) =
i=1

p(xi ).

That is, F (xj ) is the sum of the probability mass for events less than or equal to xj . Using this deﬁnition of the random variable, it follows that F (x < x1 ) = 0 and F (x ≥ xn ) = 1.

Therefore, CDF values are bounded by [0 :1] under all circumstances, even if the random variable is not indexed in this convenient fashion. In fact, we can now state technically the three deﬁning properties of a CDF: • [Cumulative Distribution Function Deﬁnition.] F (x) is a CDF for the random variable X iff it has the following properties: – bounds:
x→−∞

The idea of a right-continuous function is best understood with an illustration. Suppose we have a binomial experiment with n = 3 trials and p = 0.5. Therefore the sample space is S = {0, 1, 2, 3}, and the probabilities associated with each event are [0.125, 0.375, 0.375, 0.125]. The graph of F (x) is given in Figure 8.4, where the discontinuities reﬂect the discrete nature of a

8.3 Distribution Functions

349

binomial random variable. The solid circles on the left-hand side of each interval emphasize that this value at the integer belongs to that CDF level, and the lack of such a circle on the right-hand side denotes otherwise. The function is right-continuous because for each value of xi (i = 0, 1, 2, 3) the limit of the function reaches xi moving from the right. The arrows pointing left and right at 0 and 1, respectively, are just a reminder that the CDF is deﬁned towards negative and positive inﬁnity at these values. Note also that while the values are cumulative, the jumps between each level correspond to the PMF values f (xi ), i = 0, 1, 2, 3.

Fig. 8.4. Binomial CDF Probabilities, n = 3, p = 0.5

1.000 0.875

F(x)

0.500

0.125 0.000
0
1
2
3

x

It is important to know that a CDF fully deﬁnes a probability function, as does a PMF. Since we can readily switch between the two by noting the step sizes (CDF→PMF) or by sequentially summing (PMF→CDF), then the one we use is completely a matter of convenience.

350

Random Variables 8.3.7 Probability Density Functions

So far the random variables have only taken on discrete values. Clearly it would be a very limiting restriction if random variables that are deﬁned over some interval of the real number line (or even the entire real number line) were excluded. Unfortunately, the interpretation of probability functions for continuous random variables is a bit more complicated. As an example, consider a spinner sitting ﬂat on a table. We can measure the direction of the spinner relative to some reference point in radians, which vary from 0 to 2π (Chapter 2). How many outcomes are possible? The answer is inﬁnity because the spinner can theoretically take on any value on the real number line in [0 : 2π]. In reality, the number of outcomes is limited to our measuring instrument, which is by deﬁnition discrete. Nonetheless, it is important to treat continuous random variables in an appropriate manner. For continuous random variables we replace the probability mass function with the probability density function (PDF). Like the PMF, the PDF assigns probabilities to events in the sample space, but because there is an inﬁnite number of alternatives, we cannot say p(X = x) and so just use f (x) to denote the function value at x. The problem lies in questions such as, if we survey a large population, what is the probability that the average income were 65,123.97? Such an event is sufﬁciently rare that its probability is essentially
¢

zero. It goes to zero as a measurement moves toward being truly continuous (money in dollars and cents is still discrete, although granular enough to be treated as continuous in most circumstances). This seems ultimately frustrating, but the solution lies in the ability to replace probabilities of speciﬁc events with probabilities of ranges of events. So instead with our survey example we may ask questions such as, what is the probability that the average income amongst respondents is greater than 65,000?
¢

8.3 Distribution Functions
Fig. 8.5. Exponential PDF Forms

351

2.0

1.5

1.0

0.5

0.00

0.0

0.05

0.10

f(x)

f(x)

0.15

β = 0.1 β = 0.5 β = 1.0

0.20

2.5

β=5 β = 10 β = 50

0

1

2

3

4

0

10

20

30

40

50

x

x

8.3.8 Exponential and Gamma PDFs The exponential PDF is a very general and useful functional form that is often used to model durations (how long “things last”). It is given by f (x|β) = 1 exp[−x/β], β 0 ≤ x < ∞, 0 < β,

where, similar to the Poisson PMF, the function parameter (β here) is the mean or expected duration. One reason for the extensive use of this PDF is that it can be used to model a wide range of forms. Figure 8.5 gives six different parameterizations in two frames. Note the broad range of spread of the distribution evidenced by the different axes in the two frames. For this reason β is called a scale parameter: It affects the scale (extent) of the main density region. Although we have praised the exponential distribution for being ﬂexible, it is still a special case of the even more ﬂexible gamma PDF. The gamma distribution adds a shape parameter that changes the “peakedness” of the distribution: how sharply the density falls from a modal value. The gamma PDF is given by f (x|α, β) = 1 xα−1 exp[−x/β], 0 ≤ x < ∞, 0 < α, β, Γ(α)β α

352

Random Variables
Fig. 8.6. Gamma PDF Forms
β=1

β = 10

0.20

α=1 α=5 α = 10

0.20

α=1 α=5 α = 10

0.15

f(x)

0.10

0.05

0.00

0.00
0

0.05

0.10

f(x)

0.15

0

5

10

15

20

50

100

150

x

x

where α is the new shape parameter, and the mean is now αβ. Note the use of the gamma function (hence the name of this PDF). Figure 8.6 shows different forms based on varying the α and β parameters where the y-axis is ﬁxed across the two frames to show a contrast in effects. An important special case of the gamma PDF is the χ2 distribution, which is used in many statistical tests, including the analysis of tables. The χ2 distribution is a gamma where α =
df 2

and β = 2, and df is a positive integer value

called the degrees of freedom. Example 8.6: Characterizing Income Distributions. The gamma dis-

tribution is particularly well suited to describing data that have a mode near zero and a long right (positive) skew. It turns out that income data ﬁt this description quite closely. Pareto (1897) ﬁrst noticed that income in societies, no matter what kind of society, follows this pattern, and this effect is sometimes called Pareto’s Law. Subsequent studies showed that the gamma distribution could be easily tailored to describe a range of income distributions. Salem and Mount (1974) looked at family income in the United States from 1960 to 1969 using survey data from the Current Population Report

Series (CPS) published by the Census Bureau and ﬁt gamma distributions to categories. Figure 8.7 shows histograms for 1960 and 1969 where the gamma distributions are ﬁt according to f1960 (income) = G(2.06, 3.2418) and f1969 (income) = G(2.43, 4.3454) (note: Salem and Mount’s table contains a typo for β1969 , and this is clearly the correct value given here, as evidenced from their graph and the associated ﬁt). The unequal size categories are used by the authors to ensure equal numbers of sample values in each bin. It is clear from these ﬁts that the gamma distribution can approximately represent the types of empirical forms that income data takes.

354

Random Variables 8.3.9 Normal PDF

By far the most famous probability distribution is the normal PDF, sometimes also called the Gaussian PDF in honor of its “discoverer,” the German mathematician Carl Friedrich Gauss. In fact, until replacement with the Euro currency on January 1, 2002, the German 10 Mark note showed a plot of the normal distribution and gave the mathematical form 1 1 exp − 2 (x − µ)2 , f (x|µ, σ2 ) = √ 2σ 2πσ 2 −∞ < x, µ < ∞, 0 < σ2 ,

where µ is the mean parameter and σ 2 is the dispersion (variance) parameter. These two terms completely deﬁne the shape of the particular normal form where µ moves the modal position along the x-axis, and σ 2 makes the shape more spread out as it increases. Consequently, the normal distribution is a member of the location-scale family of distributions because µ moves only the location (and not anything else) and σ 2 changes only the scale (and not the location of the center or modal point). Figure 8.8 shows the effect of varying these two parameters individually in two panels.
Fig. 8.8. Normal PDF Forms
σ2 = 1
0.4

µ=0
0.4

µ=0 µ = −3 µ=3

σ2 = 1 σ2 = 5 σ2 = 10

0.3

f(x)

f(x)

0.2

0.1

0.0

0.0
−15

0.1

0.2

0.3

−10

0

10

20

−10

−5

0

5

10

15

x

x

The reference ﬁgure in both panels of Figure 8.8 is a normal distribution with

8.3 Distribution Functions

355

µ = 0 and σ 2 = 1. This is called a standard normal and is of great practical as well as theoretical signiﬁcance. The PDF for the standard normal simpliﬁes to 1 1 f (x) = √ exp − x2 , 2 2π −∞ < x < ∞.

The primary reason that this is an important form is that,due to the location-scale characteristic, any other normal distribution can be transformed to a standard normal and then back again to its original form. As a quick example, suppose x ∼ N (µ, σ2 ); then y = (x − µ)/σ 2 ∼ N (0, 1). We can then return to x by substituting x = yσ 2 + µ. Practically, what this means is that textbooks need only include one normal table (the standard normal) for calculating tail values (i.e., integrals extending from some point out to inﬁnity), because all other normal forms can be transformed to the standard normal in this way. One additional note relates to the normal distribution. There are quite a few other common distributions that produce unimodal symmetric forms that appear similar to the normal. Some of these, however, have quite different mathematical properties and thus should not be confused with the normal. For this reason it is not only lazy terminology, but it is also very confusing to refer to a distribution as “bell-shaped.” Example 8.7: Levels of Women Serving in U.S. State Legislatures.

Much has been made in American politics about the role of women in high level government positions (particularly due to “the year of the woman” in 1992). The ﬁrst panel of Figure 8.9 shows a histogram of the percent of women in legislatures for the 50 states with a normal distribution (µ = 21, σ = 8) superimposed (source: Center for American Women and Politics). The obvious question is whether the data can be considered normally distributed. The normal curve appears to match well the distribution given in the histogram. The problem with relying on this analysis is that the shape of a histogram is greatly affected by the number of bins selected. Consequently,

the second panel of Figure 8.9 is a “qqplot” that plots the data against standard normal quantiles (a set of ordered values from the standard normal PDF of length equal to the evaluated vector). The closer the data points are to the line, the closer they are to being normally distributed. We can see here that the ﬁt is quite close with just a little bit of deviation in the tails. Asserting that these data are actually normal is useful in that it allows us to describe typical or atypical cases more precisely, and perhaps to make predictive claims about future legislatures.

8.3.10 The Cumulative Distribution Function: Continuous Version If X is a continuous random variable, then we can also deﬁne the sum of the probability mass to the left of some point X = x: the density associated with all values less than X. Thus the function
x

F (x) = p(X ≤ x) =

−∞

f (x)dx

deﬁnes the cumulative distribution function (CDF) for the continuous random variable X. Even though this CDF is given with an integral rather than a sum,

8.3 Distribution Functions

357

it retains the three key deﬁning properties, see page 308. The difference is that instead of being a step function (as shown in Figure 8.4), it is a smooth curve monotonically nondecreasing from zero to one. Example 8.8: The Standard Normal CDF: Probit Analysis. The CDF

of the standard normal is often abbreviated Φ(X) for N (X ≤ x|µ = 0, σ 2 = 1) (the associated PDF notation is φ(X)). One application that occurs in empirical models is the idea that while people may make dichotomous choices (vote/not vote, purchase/not purchase, etc.), the underlying mechanism of decision is really a smooth, continuous preference or utility function that describes more subtle thinking. If one event (usually the positive/action choice) is labeled as “1” and the opposite event as “0,” and if there is some interval measured variable X that affects the choice, then Φ(X) = p(X = 1) is called the probit model. In the basic formulation higher levels of X are assumed to push the subject toward the “1” decision, and lower levels of X are assumed to push the subject toward the “0” decision (although the opposite effect can easily be modeled as well).
Fig. 8.10. Probit Models for Partisan Vote Choice

Probability of Voting for the Republican Candidate

0.8

1.0

Gun Ownership

0.2

0.4

0.6

No Gun Ownership

0.0

Liberal

Moderate
Ideology Measurement

Conservative

358

Random Variables To give a concrete example, consider the dichotomous choice outcome of

voting for a Republican congressional candidate against an interval measured explanatory variable for political ideology. One certainly would not be surprised to observe that more conservative individuals tend to vote Republican and more liberal individuals tend not to vote Republican. We also obtain a second variable indicating whether the respondent owns a gun. A simple probit model is speciﬁed for these data with no directly indicated interaction term: p(Yi = 1) = Φ(IDEOLOGYi + GU Ni ).

Here IDEOLOGYi is the political ideology value for individual i, GU Ni is a dichotomous variable equaling one for gun ownership and zero otherwise (it is common to weight these two values in such models, but we can skip it here without losing the general point). This model is depicted in Figure 8.10 where gun owners and nongun owners are separated. Figure 8.10 shows that gun ownership shifts the curve affecting the probability of voting for the Republican candidate by making it more likely at more liberal levels of ideology. Also, for very liberal and very conservative respondents, gun ownership does not really affect the probability of voting for the Republican. Yet for respondents without a strong ideological orientation, gun ownership matters considerably: a difference of about 50% at the center.

8.3.11 The Uniform Distributions There is an interesting distributional form that accommodates both discrete and continuous assumptions. The uniform distribution is a perfectly ﬂat form that can be speciﬁed in either manner:

The discrete case speciﬁes k outcomes (hence the conditioning on k in p(Y = y|k)) that can be given any range desired (obviously greater ranges make
1 k

smaller for ﬁxed k), and the continuous case just gives the bounds (a and b), which are often zero and one. So the point is that each outcome has equal individual probability (PMF) or equal density (PDF). This distribution is sometimes used to reﬂect great uncertainty about outcomes (although it is deﬁnitely saying something speciﬁc about the probability of events). The continuous case with a = 0 and b = 1 is particularly useful in modeling probabilities. Example 8.9: Entropy and the Uniform Distribution. Suppose we

wanted to identify a particular voter by serial information on this person’s characteristics. We are allowed to ask a consecutive set of yes/no questions (i.e., like the common guessing game). As we get answers to our series of questions we gradually converge (hopefully, depending on our skill) on the desired voter. Our ﬁrst question is, does the voter reside in California? Since about 13% of voters in the United States reside in California, a yes answer gives us different information than a no answer. Restated, a yes answer reduces our uncertainty more than a no answer because a yes answer eliminates 87% of the choices whereas a no answer eliminates 13%. If Pi is the probability of the ith event (residing in California), then the improvement in information as deﬁned by Shannon (1948) is deﬁned as IPi = log2 1 = − log2 Pi . Pi

360

Random Variables

The probability is placed in the denominator here because the smaller the probability, the greater the investigative information supplied by a yes answer. The log function is required to obtain some desired properties (discussed below) and is justiﬁed by various limit theorems. The logarithm is base-2 because there are only two possible answers to our question (yes and no), making the units of information bits. In this example, Hi = − log2 (0.13) = 2.943416 bits, whereas if we had asked, does the voter live in the state of Arkansas? then an afﬁrmative reply would have increased our information by Hi = − log2 (0.02) = 5.643856 bits, or about twice as much. However, there is a much smaller probability that we would have gotten an afﬁrmative reply had the question been asked about Arkansas. What Slater (1939) found, and Shannon (1948) later reﬁned, was the idea that the “value” of the question was the information returned by a positive response times the probability of a positive response. So if the value of the ith binary-response question is Hi = fi log2 1 = −fi log2 fi , fi
n

then the value of a series of n of these questions is
n n

Hi = k
i=1 i=1

fi log2

1 = −k fi log2 fi , fi i=1

where fi is the frequency distribution of the ith yes answer and k is an arbitrary scaling factor that determines choice of units. This is called the Shannon entropy or information entropy form. The arbitrary scaling factor here makes the choice of base in the logarithm unimportant because we can change this base by manipulating the constant. For instance, if this form were expressed in terms of the natural log, but log2 was more appropriate for the application (such as above), then setting k = form to base 2.
1 ln2

converts the entropy

8.4 Measures of Central Tendency: Mean, Median, and Mode

361

We can see that the total improvement in information is the additive value of the series of individual information improvements. So in our simple example we might ask a series of questions narrowing down on the individual of interest. Is the voter in California? Is the voter registered as a Democrat? Does the voter reside in an urban area? Is the voter female? The total information supplied by this vector of yes/no responses is the total information improvement in units of bits because the response space is binary. Its important to remember that the information obtained is deﬁned only with regard to a well-deﬁned question having ﬁnite, enumerated responses The uniform prior distribution as applied provides the greatest entropy because no single event is more likely to occur than any others: H=− 1 ln n 1 n = ln(n),

and entropy here increases logarithmically with the number of these equally likely alternatives. Thus the uniform distribution of events is said to provide the minimum information possible with which to decode the message. This application of the uniform distribution does not imply that this is a “no information” assumption because equally likely outcomes are certainly a type of information. A great deal of controversy and discussion has focused around the erroneous treatment of the uniform distribution as a zero-based information source. Conversely, if there is certainty about the result, then a degenerate distribution describes the mi , and the message does not change our information level:
n−1

H=−
i=1

(0) − log(1) = 0.

8.4 Measures of Central Tendency: Mean, Median, and Mode The ﬁrst and most useful step in summarizing observed data values is determining its central tendency: a measure of where the “middle” of the data resides

362

Random Variables

on some scale. Interestingly, there is more than one deﬁnition of what constitutes the center of the distribution of the data, the so-called average. The most obvious and common choice for the average is the mean. For n data points x1 , x2 , . . . , xn , the mean is x= ¯ 1 n
n

xi ,
i=1

where the bar notation is universal for denoting a mean average. The mean average is commonly called just the “average,” although this is a poor convention in the social sciences because we use other averages as well. The median average has a different characteristic; it is the point such that as many cases are greater as are less: For n data points x1 , x2 , . . . , xn , the median is Xi such that i = n/2 (even n) or i =
n+1 2

(odd n). This deﬁnition suits

an odd size to the dataset better than an even size, but in the latter case we just split the difference and deﬁne a median point that is halfway between the two central values. More formally, the median is deﬁned as M x = Xi :
xi −∞

fx (X)dx =

∞ xi

fx (X) =

1 . 2

Here fx (X) denotes the empirical distribution of the data, that is, the distribution observed rather than that obtained from some underlying mathematical function generating it (see Chapter 7). The mode average has a totally different ﬂavor. The mode is the most frequently observed value. Since all observed data are countable, and therefore discrete, this deﬁnition is workable for data that are actually continuously measured. This occurs because even truly continuous data generation processes, which should be treated as such, are measured or observed with ﬁnite instruments. The mode is formally given by the following: mx = Xi : n(Xi ) > n(Xj ) ∀j = i, where the notation “n()” means “number of” values equal to the X stipulated in the cardinality sense (page 294).

¯ ¯ The mean values by racial group are XBlack = 19.43, XHispanic = 5.88, ¯ ¯ XAsian = 4.00, and XWhite = 70.72. The median values differ somewhat: MBlack = 16.2, MHispanic = 5.2, MAsian = 3.4, and MWhite = 73.1. Cases where the mean and median differ noticeably are where the data are skewed (asymmetric) with the longer “tail” in the direction of the mean. For example,

364

Random Variables

the white group is negatively skewed (also called left-skewed) because the mean is noticeably less than the median. These data do not have a modal value in unrounded form, but we can look at modality through a stem and leaf plot, which groups data values by leading digits and looks like a histogram that is turned on its side. Unlike a histogram, though, the bar “heights” contain information in the form of the lower digit values. For these data we have the following four stem and leaf plots (with rounding): Hispanic: Black:
The decimal point is 1 digit to the right of the | The decimal point is at the |

0|6 1|111123455678 2|2244 2|8 3|4 3|6 4| 4|8 Asian:
The decimal point is at the |

2|49 3|4 4|233677 5|0269 6|0267 7| 8|4 9|5 10|6 11| 12|2 White:
The decimal point is 1 digit to the right of the |

1|6 2|47899 3|12334778 4|29 5|124 6|7 7| 8| 9|8

3|9 4| 5|66 6|356799 7|3345577 8|00119

Due to the level of measurement and the relatively small number of agency cases (21), we do not have an exact modal value. Nonetheless, the stem and

8.5 Measures of Dispersion: Variance, Standard Deviation, and MAD365 leaf plot shows that values tend to clump along a single modal region in each case. For instance, if we were to round the Asian values to integers (although this would lose information), then the mode would clearly be 2%. One way to consider the utility of the three different averages is to evaluate their resistance to large outliers. The breakdown bound is the proportion of data values that can become unbounded (go to plus or minus inﬁnity) before the statistic of interest becomes unbounded itself. The mean has a breakdown bound of 0 because even one value of inﬁnity will take the sum to inﬁnity. The median is much more resistant because almost half the values on either side can become unbounded before the median itself goes to inﬁnity. In fact, it is customary to give the median a breakdown bound of 0.5 because as the data size increases, the breakdown bound approaches this value. The mode is much more difﬁcult to analyze in this manner as it depends on the relative placement of values. It is possible for a high proportion of the values to become unbounded provided a higher proportion of the data is concentrated at some other point. If these points are more spread out, however, the inﬁnity point may become the mode and thus the breakdown bound lowers. Due to this uncertainty, the mode cannot be given a deﬁnitive breakdown bound value.

8.5 Measures of Dispersion: Variance, Standard Deviation, and MAD The second most important and common data summary technique is calculating a measure of spread, how dispersed are the data around a central position? Often a measure of centrality and a measure of spread are all that are necessary to give researchers a very good, general view of the behavior of the data. This is particularly true if we know something else, such as that the data are unimodal and symmetric. Even when we do not have such complementary information, it is tremendously useful to know how widely spread the data points are around some center.

366

Random Variables

The most useful and common measure of dispersion is the variance. For n data points x1 , x2 , . . . , xn , the variance is given by Var(X) = The preceding fraction,
1 n−1 , n

1 n−1

(xi − x)2 . ¯
i=1 1 n

is slightly surprising given the more intuitive

for the mean. It turns out that without the −1 component the statistic is biased: not quite right on average for the true underlying population quantity. A second closely related quantity is the standard deviation, which is simply the square root of the variance: SD(X) = Var(X) = 1 n−1
n

(xi − x)2 . ¯
i=1

Since the variance and the standard deviation give the same information, the choice of which one to use is often a matter of taste. There are times, however, when particular theoretical discussions need to use one form over the other. A very different measure of dispersion is the median absolute deviation (MAD). This is given by the form M AD(X) = median(|xi − median(x)|), for i = 1, 2, . . . , n. That is, the MAD is the median of the absolute deviations from the data median. Why is this useful? Recall our discussion of resistance. The variance (and therefore the standard deviation) is very sensitive to large outliers, more so even than the mean due to the squaring. Conversely, the MAD obviously uses medians, which, as noted, are far more resistant to atypical values. Unfortunately, there are some differences in the way the MAD is speciﬁed. Sometimes a mean is used instead of the innermost median, for instance, and sometimes there is a constant multiplier to give asymptotic properties. This is irritating because it means that authors need to say which version they are providing. Example 8.11: Employment by Race in Federal Agencies, Continued.

Returning to the values in Table 8.2, we can calculate the three described

8.6 Correlation and Covariance

367

measures of dispersion for the racial groups. It is important to remember that none of these three measures is necessarily the “correct” view of dispersion in the absolute sense, but rather that they give different views of it. Table 8.3. Measures of Dispersion, Race in Agencies
Black variance standard deviation MAD 107.20 10.35 5.60 Hispanic 6.18 2.49 1.00 Asian 3.16 1.78 0.60 White 122.63 11.07 6.60

8.6 Correlation and Covariance A key question in evaluating data is to what extent two variables vary together. We expect income and education to vary in the same direction: Higher levels of one are associated with higher levels of the other. That is, if we look at a particular case with a high level of education, we expect to see a high income. Note the use of the word “expect” here, meaning that we are allowing for cases to occur in opposition to our notion without necessarily totally disregarding the general idea. Of course, if a great many cases did not reﬂect the theory, we would be inclined to dispense with it. Covariance is a measure of variance with two paired variables. Positive values mean that there is positive varying effect between the two, and negative values mean that there is negative varying effect: High levels of one variable are associated with low levels of another. For two variables of the same length, x1 , x2 , . . . , xn and y1 , y2 , . . . , yn , the covariance is given by Cov(X, Y ) = 1 n−1
n

(xi − x)(yi − y ). ¯ ¯
i=1

This is very useful because it gives us a way of determining if one variable tends to vary positively or negatively with another. For instance, we would expect income and education levels to vary positively together, and income and prison time to vary negatively together. Furthermore, if there is no relationship

368

Random Variables

between two variables, then it seems reasonable to expect a covariance near zero. But there is one problem with the covariance: We do not have a particular scale for saying what is large and what is small for a given dataset. What happens if we calculate the covariance of some variable with itself? Let’s take Cov(X, Y ) and substitute in Y = X: Cov(X, X) = 1 n−1
n

(xi − x)(xi − x) ¯ ¯
i=1 n

1 = n−1

(xi − x)2 ¯
i=1

= Var(X). This means that the covariance is a direct generalization of the variance where we have two variables instead of one to consider. Therefore, while we do not generally know the context of the magnitude of the covariance, we can compare it to the magnitude of the variance for X as a reference. So one solution to the covariance scale problem is to measure the covariance in terms of units of the variance of X: Cov∗ (X, Y ) =
1 n−1 n ¯ i=1 (xi − x)(yi − n 1 ¯2 i=1 (xi − x) n−1

y) ¯

.

In this way units of the covariance are expressed in units of the variance that we can readily interpret. This seems unfair to the Y variable as there may not be anything special about X that deserves this treatment. Now instead let us measure the covariance in units of the standard deviation of X and Y : Cov∗∗ (X, Y ) =
1 n−1 1 n−1 n i=1 (xi n i=1 (xi

− x)(yi − y ) ¯ ¯
1 n−1 n i=1 (yi

− x)2 ¯

− y )2 ¯

.

The reason we use the standard deviation of X and Y in the denominator is it scales this statistic conveniently to be bounded by [−1 : 1]. That is, if we re-performed our trick of substituting Y = X (or equivalently X = Y ), then the statistic would be equal to one. In substantive terms, a value of one means that Y covaries exactly as X covaries. On the other hand, if we substituted

8.6 Correlation and Covariance

369

Y = −X (or conversely X = −Y ), then the statistic would be equal to negative one, meaning that Y covaries in exactly the opposite manner as X. Since these are the limits of the ratio, any value inbetween represents lesser degrees of absolute scaled covariance. This statistic is important enough to have a name: It is the correlation coefﬁcient between X and Y (more formally, it is Pearson’s Product Moment Correlation Coeﬃcient ), and is usually denoted cor(X, Y ) or rXY . Example 8.12: An Ethnoarchaeological Study in the South American

Tropical Lowlands. Siegel (1990) looked at the relationship between the size of buildings and the number of occupants in a South Amerindian tropicalforest community located in the upper Essequibo region of Guyana. In such communities the household is the key structural focus in social, economic, and behavioral terms. The substantive point is that the overall settlement area of the community is a poor indicator of ethnographic context in terms of explaining observed relationships, but other ethnoarchaeological measures are quite useful in this regard. Siegel points out that ethnographic research tends not to provide accurate and useful quantitative data on settlement and building dimensions. Furthermore, understanding present-day spatial relationships in such societies has the potential to add to our understanding in historical archaeological studies. The main tool employed by Siegel was a correlational analysis between ﬂoor area of structures and occupational usage. There are four types of structures: residences, multipurpose work structures, storage areas, and community buildings. In these tribal societies it is common for extended family units to share household space including kitchen and storage areas but to reserve a component of this space for the nuclear family. Thus there is a distinction between households that are encompassing structures, and the individual residences within. Table 8.4 gives correlation coefﬁcients between the size of the ﬂoor area for three deﬁnitions of space and the family unit for the village of Shefariymo where Total is the sum of Multipurpose space, Residence

What we see from this analysis is that there exists a positive but weak relationship between the size of the nuclear family and the size of the multipurpose space (0.137). Conversely, there are relatively strong relationships between the size of these same nuclear families and the size of their residences (0.662) and the total family space (0.714). A similar pattern emerges when looking at the size of the extended families occupying these structures except that the relationships are now noticeably stronger. Not surprisingly, the size of the extended family is almost perfectly correlated with residence size and total size.

8.7 Expected Value Expected value is essentially a probability-weighted average over possible events. Thus, with a fair coin, there are 5 expected heads in 10 ﬂips. This does not mean that 5 heads will necessarily occur, but that we would be inclined to bet on 5 rather than any other number. Interestingly, with real-life interval measured data you never get the expected value in a given experiment because the probability of any one point on the real number line is zero. The discrete form of the expected value of some random variable X is
k

E[X] =
i=1

Xi p(Xi ),

8.7 Expected Value

371

for k possible events in the discrete sample space {X1 , X2 , . . . , Xk }. The continuous form is exactly the same except that instead of summing we need to integrate: E[X] =
∞ −∞

Xp(X)dX,

where the integral limits are adjusted if the range is restricted for this random variable, and these are often left off the integral form if these bounds are obvious. Intuitively, this is easier to understand initially for the discrete case. Suppose someone offered you a game that consisted of rolling a single die with the following payoffs: die face X, in dollars 1 0 2 1 3 1 4 1

5 2

6 2

Would you be inclined to play this game if it costs 2? The expected value of a play is calculated as
6

E[X] =
i=1

Xi p(Xi ) =

1 1 1 1 1 1 (0) + (1) + (1) + (1) + (2) + (2) 6 6 6 6 6 6

= $1.67 (rounded). Therefore it would not make sense to pay 2 to play this game! This is exactly how all casinos around the world function: They calculate the expected value of a play for each of the games and make sure that it favors them. This is not to say that any one person cannot beat the casino, but on average the casino always comes out ahead. So far we have just looked at the expected value of X, but it is a simple matter to evaluate the expected value of some function of X. The process inserts the function of X into the calculation above instead of X itself. Discrete and continuous forms are given by
k ∞ −∞

The calculation of expected value for vectors and matrices is only a little bit more complicated because we have to keep track of the dimension. A k × 1 vector X of discrete random variables has the expected value E[X] = Xp(X). For the expected value of a function of the continuous random vector it is common to use the Riemen-Stieltjes integral form:

E[f (X)] =

f (X)dF (X),

where F (X) denotes the joint distribution of the random variable vector X. In much statistical work expected value calculations are “conditional” in the sense that the average for the variable of interest is taken conditional on another. For instance, the discrete form for the expected value of Y given a speciﬁc level of X is
k

E[Y |X] =
i=1

Yi p(Yi |X).

Sometimes expectations are given subscripts when there are more than one random variables in the expression and it is not obvious to which one the expectation holds:
k

Varx [Ey [Y |X]] = Varx
i=1

Yi p(Yi |X) .

8.8 Some Handy Properties and Rules Since expectation is a summed quantity, many of these rules are obvious, but some are worth thinking about. Let X, Y , and Z be random variables deﬁned in R (the real number line), whose expectations are ﬁnite.

and winning probabilities at the game craps as a way to illustrate expected

374

Random Variables

value. The key principle guiding casino management is that every game has negative expected value to the customer. However, craps has many bets that are very nearly “fair” in that the probability of winning is just below 0.5. This tends to attract the more “sophisticated” gamblers, but of course craps still makes money for the house. The basic process of a craps game is that one person (the shooter) rolls two dice and people bet on the outcome. The most common bet is a “pass,” meaning that the player has an unconditional win if the result is 7 or 11 and an unconditional loss if the result is 2, 3, or 12 (called craps). If the result, however, is 4, 5, 6, 8, 9, or 10, then the outcome is called a “point” and the shooter repeats until either the outcome is repeated (a win) or a 7 appears (a loss). The probabilities associated with each of the 11 possible sums on any given role are

= 0.270707. This means that the probability of winning including the pass is 0.270707 + 0.2222222 = 0.4929292. So the expected value of a 5 bet is the expected winnings minus the cost to play: 10 × 0.4929292 + 0 × 0.5070708 − 5 = −0.070708, meaning about negative seven cents. A player can also play “don’t pass bar 12,” which is the opposite bet except that 12 is considered a tie (the gamblers bet is returned). The probability of winning on this bet is 1 1 1 − p(pass) − p(12) = 1 − 0.4929292 − 2 2
¢ ¢

1 36

= 0.4931818,

which has for a 5 bet the expected value 10 × 0.4931818 − 5 = −0.068182, which is slightly better than a pass. Two variants are the “come” bet where the player starts a pass bet in the middle of the shooter’s sequence, and the “don’t come” bet where the player starts a don’t pass bar 12 in the middle of the shooter’s sequence. Predictably these odds are identical to the pass and don’t pass bar 12 bets, respectively. Another common bet is the “ﬁeld,” which bets that a 2, 3, 4, 8, 10, 11, or 12 occurs, with the probability of winning: p(2) + p(3) + p(4) + p(9) + p(10) + p(11) + p(12) = 0.4444444, but a 2 or 12 pays double, thereby increasing the total expected value: 1.5p(2)+1p(3)+1p(4)+1p(9)+1p(10)+1p(11)+1.5p(12) = 0.4722222.

376
5/36 (5+6)/36

Random Variables

The probability of winning with a “big six” or “big eight” bet (6 or 8 comes up before 7) is = 0.4545454. It is also possible to bet on a speciﬁc value for the next roll, and the house sets differing payoffs according to

Note that the payouts used above may differ by casino/area/country/etc. Another bet in this category is “any craps,” which means betting on the occurrence of 2, 3, or 12. The payoff is 7/1, so p(2, 3, 12) =
¢

4 36

7+1 = 0.4444444, 2

and the expected value of a 5 bet is 10 × 0.4444444 − 5 = −0.55556. The really interesting bet is called “odds,” which is sometime billed falsely as giving even money to the player. This bet is allowed only during an

8.8 Some Handy Properties and Rules

377

ongoing pass, don’t pass, come, or don’t come bet, and the actual bet takes place during a point. Sometimes an equal or smaller bet compared to the original bet only is allowed, but some casinos will let you double here. The actual bet is that the point value re-occurs before a 7, and one can also bet the “contrary” bet that it won’t (evidence of fairness for the second component of the bet). What does this mean in terms of probabilities and payoffs? The “old” game continues with the same probability of winning that beneﬁts the house (0.4929292), but now a new game begins with new odds and the same rules as the point part of a pass play. A new betting structure starts with the payoffs 4 or 10 before 7 : 2−1, 5 or 9 before 7 : 1.5−1, 6 or 8 before 7 : 1.2−1. The probabilities of each point value occurring before 7 (recall that you have one of these) are p(4 before 7) = p(5 before 7) = p(6 before 7) = 3 = 3+6 4 = 4+6 5 = 5+6 1 3 2 5 5 11 p(10 before 7) = p(9 before 7) = p(8 before 7) =

3 = 3+6 4 = 4+6 5 = 5+6

1 3 2 5 5 11 .

So what are the odds on this new game (betting 1)? They are

4, 10 : 5, 9 : 6, 8 :

1 3 2 5 5 11

2+1 2 1.5 + 1 2

=

1 2 = 1 2 1 . 2

1.2 + 1 2

=

The house still comes out ahead because you cannot play the even-money game independently, so the total probability is the average of 0.4929292 and 0.5, which is still below 0.5 (weighted by the relative bets). Also, the

378

Random Variables

second half of pass bet when you are on the points has the probabilities p(4, 5, 6, 8, 9, 10 before 7) = 0.40606 . . . and p(7 ﬁrst) = 0.59393 . . .. But apparently most craps players aren’t sophisticated enough to use this strategy anyway.

8.9 Inequalities Based on Expected Values There are a number of “famous” inequalities related to expected values that are sometimes very useful. In all cases X and Y are random variables with expected values that are assumed to exist and are ﬁnite. The positive constants k and are also assumed ﬁnite. These assumptions are actually important and the compelling book by Romano and Siegel, Counterexamples in Probability and Statistics (1986), gives cases where things can go awry otherwise. A classic reference on just inequalities is Inequalities by Hardy, Littlewood, and Polya (1988). • Chebychev’s Inequality. If f (X) is a positive and nondecreasing function on [0, ∞], then for all (positive) values of k p(f (X) > k) ≤ E[f (X)/k]. A more common and useful form of Chebychev’s inequality involves µ and σ, the mean and standard deviation of X (see Section 8.5). For k greater than or equal to 1: p(|X − µ| ≥ kσ) ≤ 1/k 2 . To relate these two forms, recall that µ is the expected value of X. • Markov Inequality. Similar to Chebychev’s Inequality: P [|X| ≥ k] ≤ E[|X| ]/k . • Jensen’s Inequality. If f (X) is a concave function (open toward the x-axis, like the natural log function), then E[f (X)] ≤ f (E[X]).

Note also that these inequalities apply for conditional expectations as well. For instance, the statement of Liapounov’s Inequality conditional on Y is (E[|X|]k |Y )1/k ≤ (E[|X|] |Y )1/ .

8.10 Moments of a Distribution Most (but not all) distributions have a series of moments that deﬁne important characteristics of the distribution. In fact, we have already seen the ﬁrst moment, which is the mean or expected value of the distribution. The general formula for the kth moment is based on the expected value:

mk = E[X k ] =
X

xk dF (x)

for the random variable X with distribution f (X) where the integration takes place over the appropriate support of X. It can also be expressed as

mk =
X

ekx dF (x),

which is more useful in some circumstances. The kth central moment is (often called just the “kth moment”)

mk = E[(X − m1 )k ] =
X

(x − m1 )k dF (x).

So the central moment is deﬁned by a deviation from the mean. The most ¯ obvious and important central moment is the variance: σ 2 = E[(X − X)2 ]. We can use this second central moment to calculate the variance of a PDF or

8.10 Moments of a Distribution PMF. This calculation for the exponential is

381

Var[X] = E[(X − E[X])2 ] = = = =
∞ 0 ∞ 0 ∞ 0 ∞ 0

(X − E[X])2 f (x|β)dx (X − β)2 1 exp[−x/β]dx β 1 exp[−x/β]dx β

(X 2 − 2Xβ + β 2 ) X2 +

1 exp[−x/β]dx β
∞ 0

2X exp[−x/β]dx +

∞ 0

β exp[−x/β]dx

= (0 − 2β 2 ) + (2β 2 ) + (β 2 ) = β2,

where we use integration by parts and L’Hospital’s Rule to do the individual integrations. An important theory says that a distribution function is “determined” by its moments of all orders (i.e., all of them), and some distributions have an inﬁnite number of moments deﬁned. The normal distribution actually has an inﬁnite number of moments. Conversely, the Cauchy PDF has no ﬁnite moments at all, even though it is “bell shaped” and looks like the normal (another reason not to use that expression). The Cauchy distribution] has the PDF

1 While the ﬁrst term above is ﬁnite because arctan(±∞) = ± 2 π, the second

term is clearly inﬁnite. It is straightforward to show that higher moments are also inﬁnite and therefore undeﬁned (an exercise), and clearly every central moment is also inﬁnite as they all include m1 .

The following data are exam grade percentages: 37, 39, 28, 73, 50, 59, 41, 57, 46, 41, 62, 28, 26, 66, 53, 54, 37, 46, 25. (a) What is the level of measurement for these data? (b) Suppose we change the data to create a new dataset in the following way: values from 25 to 45 are assigned “Low,” values from 45 to 60 are assigned “Medium,” and values from 60 to 75 are assigned “High.” Now what is the level of measurement for these data? (c) Now suppose we take the construction from (b) and assign “Low” and “High” to “Atypical” and assign “Medium” to “Typical.” What is the level of measurement of this new dataset? (d) Calculate the mean and standard deviation of each of the three datasets created above.

8.3

Morrison (1977) gave the following data for Supreme Court vacancies from 1837 to 1932: Number of Vacancies/Year Number of Years for Event 0 59 1 27 2 9 3 1 4+ 0

Exercises

385

Fit a distribution to these data, estimating any necessary parameters. Using this model, construct a table of expected versus observed frequencies by year. 8.4 The National Taxpayers Union Foundation (NTUF), an interest group that advocates reduced government spending, scores House members on the budgetary impact of their roll call votes. A “spending” vote is one in favor of a bill or amendment that increases federal outlays and a “saving” vote is one that speciﬁcally decreases federal spending (i.e., program cuts). The ﬁscal impact of each House member’s vote is cross-indexed and calculated as the total increase to the budget or the total decrease to the budget. The NTUF supplies these values along with a ranking of each member’s “ﬁscal responsibility,” calculated by adding all positive and negative ﬁscal costs of each bill voted on by each member and then ranking members by total cost. What is the level of measurement of the NTUF ﬁscal responsibility scale? Since House members’ values are compared in NTUF public statements, is there a different level of measurement being implied? 8.5 Suppose you had a Poisson process with intensity parameter λ = 5. What is the probability of getting exactly 7 events? What is the probability of getting exactly 3 events? These values are the same distance from the expected value of the Poisson distribution, so why are they different? 8.6 Given the following PMF: ⎧ ⎪ 3! ⎨ x!(3−x)! f (x) = ⎪ ⎩0

1 2

3

x = 0, 1, 2, 3 otherwise,

(a) prove that this is in fact a PMF; (b) ﬁnd the expected value; (c) ﬁnd the variance; (d) Derive the CDF.

386 8.7

Random Variables Let X be the event that a single die is rolled and the resulting number is even. Let Y be the event describing the actual number that results from the roll (1–6). Prove the independence or nonindependence of these two events.

X1 + X2 and Y2 = |X1 − X2 | correlated? Are they independent? 8.9 Suppose we have a PMF with the following characteristics: p(X = −2) = 1 , p(X = −1) = 1 , p(X = 0) = 5 6 p(X = 2) = variance of Y . 8.10 Charles Manski (1989) worried about missing data when the outcome variable of some study had missing values rather than when the variables assumed to be causing the outcome variable had missing values, which is the more standard concern. Missing values can cause series problems in making probabilistic statements from observed data. His ﬁrst concern was notated this way: “Suppose each member of a population is characterized by a triple (y, x, z) where y is a real number, z is a binary indicator, and x is a real number vector.” The problem is that, in collecting these data, (z, x) are always observed, but y is observed only when z = 1. The quantity of interest is E(y|x). Use the Theorem of Total Probability to express this conditional probability when the data only provide E(y|x, z = 1) and we cannot assume mean independence: E(y|x) = E(y|x, z = 1) = E(y|x, z = 0). 8.11 Twenty developing countries each have a probability of military coup of 0.01 in any given year. We study these countries over a 10-year period. (a) How many coups do you expect in total? (b) What is the probability of four coups?
11 30 . 1 5,

p(X = 1) =
2

1 15 ,

and

Deﬁne the random variable Y = X . Derive the

PMF of Y and prove that it is a PMF. Calculate the expected value and

Exercises

387

(c) What is the probability that there will be no coups during this period? 8.12 Show that the full parameter normal PDF f (X|µ, σ 2 ) reduces to the standard normal PDF when µ = 0 and σ 2 = 1. 8.13 Use the exponential PDF to answer the following questions. (a) Prove that the exponential form is a PDF. (b) Derive the CDF. (c) Prove that the exponential distribution is a special case of the gamma distribution. 8.14 Use the normal PDF to answer the following questions. (a) If a normal distribution has µ = 25 and σ = 25, what is the 91st percentile of the distribution? (b) What is the 6th percentile of the distribution of part (a)? (c) The width of a line etched on an integrated circuit chip is normally distributed with mean 3.000µm and standard deviation 0.150. What width value separates the widest 10% of all such lines from the other 90%? 8.15 A function that can be used instead of the probit function is the logit function: Λ(X) = you observe? 8.16 The beta function is deﬁned for nonnegative values a and b as: B(a, b) =
1 0 exp(X) 1+exp(X) .

Plot both the logit function and the

probit function in the same graph and compare. What differences do

xa−1 (1 − x)b−1 dx.

This form is used in some statistical problems and elsewhere. The relationship between the beta and gamma functions is given by B(a, b) = Γ(a)Γ(b) . Γ(a + b)

Prove this using the properties of PDFs. 8.17 Prove that E[Y |Y ] = Y .

388 8.18

Random Variables Suppose that the performance of test-takers is normally distributed around a mean, µ. If we observe that 99% of the students are within 0.194175 of the mean, what is the value of σ?

8.19

Calculate the entropy of the distribution B(n = 5, p = 0.1) and the distribution B(n = 3, p = 0.5). Which one is greater? Why?

8.20

We know that the reaction time of subjects to a speciﬁc visual stimuli is distributed gamma with α = 7 and β = 3, measured in seconds. (a) What is the probability that the reaction time is greater that 12 seconds? (b) What is the probability that the reaction time will be between 15 and 21 seconds? (c) What is the 95th percentile value of this distribution?

8.21

Show that the second moment of the Cauchy distribution is inﬁnite and therefore undeﬁned.

8.22

The following data are temperature measurements in Fahrenheit. Use these data answer the following questions. 38.16 52.68 53.47 50.18 49.13

(a) Is the median bigger or smaller than the mean? (b) Calculate the mean and standard deviation. (c) What is the level of measurement for these data? (d) Suppose we transformed the data in the following way: Values from 0 to 40 are assigned “Cold,” values from 41 to 70 are assigned “Medium,” and values above 71 are assigned “Hot.” Now what is the level of measurement for these data? (e) Suppose we continue to transform the data in the following way: “Cold” and “Hot” are combined into “Uncomfortable,” and “Medium” is renamed “Comfortable.” What is the level of measurement for these data?

Exercises 8.23

389

The following is a stem and leaf plot for 20 different observations (stem = tens digit). Use these data to answer the questions.

0 1 2 3 4 5 6 7 8

7 0 0 2

8 1 1 4

9 5 1

9 7 8

9 7 9

5 3 9

(a) Is the median bigger or smaller than the mean? (b) Calculate the 10% trimmed mean. (c) Make a frequency distribution with relative and relative cumulative frequencies. (d) Calculate the standard deviation. (e) Identify the IQR. 8.24 Nine students currently taking introductory statistics are randomly selected, and both the ﬁrst midterm score (x) and the second midterm score (y) are determined. Three of the students have the class at 8 A.M., another three have it at noon, and the remaining three have a night class. 8 A.M. Noon Night (70,60) (80,72) (45,63) (72,83) (60,74) (50,40) (94,85) (55,58) (35,54)

Random Variables (b) Let x1 = the average score on the ﬁrst midterm for the 8 A.M. ¯ students and y1 = the average score on the second midterm for ¯ ¯ these students. Let x2 and y2 be these averages for the noon stu¯ ¯ dents, and x3 and y3 be these averages for the evening students. ¯ Calculate r for these three (¯, y) pairs. x ¯ (c) Construct a scatterplot of the nine (x, y) pairs and construct another one of the three (¯, y ) pairs. Does this suggest that a x ¯ correlation coefﬁcient based on averages (an “ecological” correlation) might be misleading? Explain.

8.25

The Los Angeles Times (Oct. 30, 1983) reported that a typical customer of the 7-Eleven convenience stores spends 3.24. Suppose that the average amount spent by customers of 7-Eleven stores is the reported value of 3.24 and that the standard deviation for the amount of sale is 8.88.

(a) What is the level of measurement for these data? (b) Based on the given mean and standard deviation, do you think that the distribution of the variable amount of sale could have been symmetric in shape? Why or why not? (c) What can be said about the proportion of all customers that spend more than 20 on a purchase at 7-Eleven?

(a) Calculate the variance and the MAD of each of the three variables. (b) Calculate the correlation coefﬁcients. Truncate the variables such that there are no values to the right of the decimal point and recalculate the correlation coefﬁcients. Do you see a difference? Why or why not?

9
Markov Chains

9.1 Objectives This chapter introduces an underappreciated topic in the social sciences that lies directly in the intersection of matrix algebra (Chapters 3 and 4) and probability (Chapter 7). It is an interesting area in its own right with many applications in the social sciences, but it is also a nice reinforcement of important principles we have already covered. Essentially the idea is relevant to the things we study in the social sciences because Markov chains speciﬁcally condition on the current status of events. Researchers ﬁnd that this is a nice way to describe individual human decision making and collective human determinations. So Markov chains are very practical and useful. They model how social and physical phenomena move from one state to another. The ﬁrst part of this chapter introduces the mechanics of Markov chains through the kernel. This is the deﬁning mechanism that “moves” the Markov chain around. The second part of the chapter describes various properties of Markov chains and how such chains can differ in how they behave. The ﬁrst few properties are elementary, and the last few properties are noticeably more advanced and may be skipped by the reader who only wants an introduction. 392

Markov chains sound like an exotic mathematical idea, but actually the principle is quite simple. Suppose that your decision-making process is based only on a current state of affairs. For example, in a casino wagering decisions are usually dictated only by the current state of the gambler: the immediate confronting decision (which number to bet on, whether to take another card, etc.) and the available amount of money. Thus these values at a previous point in time are irrelevant to future behavior (except perhaps in the psychological sense). Similarly, stock purchase decisions, military strategy, travel directions, and other such trajectories can often successfully be described with Markov chains. What is a Markov chain? It is a special kind of stochastic process that has a “memoryless” property. That does not help much, so let us be speciﬁc. A stochastic process is a consecutive series of observed random variables. It is a random variable deﬁned in exactly the standard way with a known or unknown distribution, except that the order of events is recordable. So for some state space, Θ, the random variable θ is deﬁned by θ[t] ∼ F (θ), t ∈ T , where t is some index value from the set T . Actually it is almost always more simple than this general deﬁnition of indexing since it is typical to deﬁne T as the positive integers so t = 0, 1, 2, 3, . . .. The implication from this simpliﬁcation is that time periods are evenly spaced, and it is rare to suppose otherwise. The state space that a stochastic process is deﬁned on must be identiﬁed. This is exactly like the support of a probability mass function (PMF) or probability density function (PDF) in that it deﬁnes what values θ[t] can take on any point in time t. There are two types of state spaces: discrete and continuous. In general, discrete state spaces are a lot more simple to think about and we will therefore focus only on these here. Suppose that we had a cat locked in a classroom with square tiles on the ﬂoor. If we deﬁned the room as the state space (the cat cannot leave) and each square tile as a discrete state that the cat can occupy, then the path of the cat walking throughout the room is a stochastic process where we record the

394

Markov Chains

grid numbers of the tiles occupied by the cat over time. Now suppose that the walking decisions made by the cat are governed only by where it is at any given moment: The cat does not care where it has been in the past. To anyone who knows cats, this seems like a reasonable assumption about feline psychology. So the cat forgets previous states that it has occupied and wanders based only on considering where it is at the moment. This property means that the stochastic process in question is now a special case called a Markov chain. More formally, a Markov chain is a stochastic process with the property that any speciﬁed state in the series, θ[t] , is dependent only on the previous state, θ[t−1] . But wait, yesterday’s value (θ[t−1] ) is then also conditional on the day before that’s value (θ[t−2] ), and so on. So is it not then true that θ[t] is conditional on every previous value: 0, 1, 2, . . . , t − 1? Yes, in a sense, but conditioned on θ[t−1] , θ[t] is independent of all previous values. This is a way of saying that all information that was used to determine the current step was contained in the previous step, and therefore if the previous step is considered, there is no additional information of importance in any of the steps before that one. This “memoryless” property can be explicitly stated in the probability language from Chapter 7. We say that θ[t] is conditionally independent on all values previous to θ[t−1] if p(θ[t] ∈ A|θ[0] , θ[1] , . . . , θ[t−2] , θ[t−1] ) = p(θ[t] ∈ A|θ[t−1] ), where A is any identiﬁed set (an event or range of events) on the complete state space (like our set of tiles in the classroom). Or, more colloquially, “a Markov chain wanders around the state space remembering only where it has been in the last period.” Now that does sound like a cat! A different type of stochastic process that sometimes gets mentioned in the same texts is a martingale. A martingale is deﬁned using expectation instead of probability: E(θ[t] ∈ A|θ[0] , θ[1] , . . . , θ[t−2] , θ[t−1] ) = θ[t−1] , meaning that the expected value that the martingale is in the set A in the next period is the value of the current position. Note that this differs from the Markov

9.2 Deﬁning Stochastic Processes and Markov Chains

395

chain in that there is a stable iterative process based on this expectation rather than on Markovian probabilistic exploration. Also, since the future at time t+1 and the past at time t are independent given the state at time t, the Markovian property does not care about the direction of time. This seems like a weird ﬁnding, but recall that time here is not a physical characteristic of the universe; instead it is a series of our own construction. Interestingly, there are Markov chains that are deﬁned to work backward through “time,” such as those in coupling from the past (see Propp and Wilson 1996). In general interest is restricted here to discrete-time, homogeneous Markov chains. By discrete time, we simply mean that the counting process above, t = 0, 1, . . . , T , is recordable at clear, distinguishable points. There is a corresponding study of continuous-time Markov processes, but it is substantially more abstract and we will not worry about it here. A homogeneous Markov chain is one in which the process of moving (i.e., the probability of moving) is independent of current time. Stated another way, move decisions at time t are independent of t.

Example 9.1:

Contraception Use in Barbados. Ebanks (1970) looked

at contraception use by women of lower socio-economic class in Barbados and found a stable pattern in the 1950s and a different stable pattern emerged in the late 1960s. This is of anthropological interest because contraception and reproduction are key components of family and social life in rural areas. His focus was on the stability and change of usage, looking at a sample from family planning programs at the time. Using 405 respondents from 1955 and another 405 respondents from 1967, he produced the following change probabilities where the row indicates current state and the column indicates the next state (so, for instance, the probability of moving from “Use” at the current state to “Not Use” in the next state for 1955 is 0.52):

396 1955
Use Not Use Use

Markov Chains
Not Use

1967
Use Not Use

Use

Not Use

0.48 0.08

0.52 0.92

0.89 0.52

0.11 0.48

The ﬁrst obvious pattern that emerges is that users in the 1950s were nearly equally likely to continue as to quit and nonusers were overwhelming likely to continue nonuse. However, the pattern is reversed in the late 1960s, whereby users were overwhelmingly likely to continue and nonusers were equally likely to continue or switch to use. If we are willing to consider these observed (empirical) probabilities as enduring and stable indications of underlying behavior, then we can “run” a Markov chain to get anticipated future behavior. This is done very mechanically by treating the 2 × 2 tables here as matrices and multiplying repeatedly. What does this do? It produces expected cell values based on the probabilities in the last iteration: the Markovian property. There are (at least) two interesting things we can do here. Suppose we were interested in predicting contraception usage for 1969, that is, two years into the future. This could be done simply by the following steps: ⎡ ⎣ ⎤ ⎡ ⎤ ⎡ ⎤ ⎦

0.89 0.11 0.52 0.48

⎦×⎣

0.89 0.11 0.52 0.48

⎦=⎣

0.85 0.15 0.71 0.29
1968

⎡ ⎣

0.85 0.15 0.71 0.29

⎤

⎡

⎦×⎣

0.89 0.11 0.52 0.48

⎤

⎡

⎦=⎣

0.83 0.17 0.78 0.22
1969

⎤ ⎦.

This means that we would expect to see an increase in nonusers converting to users if the 1967 rate is an underlying trend. Secondly, we can test Ebanks’ assertion that the 1950s were stable. Suppose we take the 1955 matrix of transitions and apply it iteratively to get a predicted distribution across the

9.2 Deﬁning Stochastic Processes and Markov Chains

397

four cells for 1960. We can then compare it to the actual distribution seen in 1960, and if it is similar, then the claim is supportable (the match will not be exact, of course, due to sampling considerations). Multiplying the 1955 matrix four times gives ⎤ ⎡ ⎤4 ⎡ 0.16 0.84 0.48 0.52 ⎦. ⎣ ⎦ =⎣ 0.13 0.87 0.08 0.92

This suggests the following empirical distribution, given the marginal numbers of users for 1959 in the study:

1959-1960 predicted Use Not Use

Use 7 46

Not Use 41 311

which can be compared with the actual 1960 numbers from that study: 1959-1960 actual Use Not Use Use 27 39 Not Use 21 318

These are clearly dissimilar enough to suggest that the process is not Markovian as claimed. More accurately, it can perhaps be described as a
26 martingale since the 1955 actual numbers are [ 24 327 ]. What we are actually 28

seeing here is a question of whether the 1955 proportions deﬁne a transaction kernel for the Markov chain going forward. The idea of a transaction kernel is explored more fully in the next section.

9.2.1 The Markov Chain Kernel We know that a Markov chain moves based only on its current position,but using that information, how does the Markov chain decide? Every Markov chain is

398

Markov Chains

deﬁned by two things: its state space (already discussed) and its transition kernel, K(). The transition kernel is a general mechanism for describing the probability of moving to other states based on the current chain status. Speciﬁcally, K(θ, A) is a deﬁned probability measure for all θ points in the state space to the set A ∈ Θ: It maps potential transition events to their probability of occurrence. The easiest case to understand is when the state space is discrete and K is just a matrix mapping: a k × k matrix for k discrete elements in that exhaust the allowable state space, A. We will use the notation θi , meaning the ith state of the space. So a Markov chain that occupies subspace i at time t is designated θi . Each individual cell deﬁnes the probability of a state transition from the ﬁrst term to all possible states: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ K=⎢ ⎢ ⎢ ⎢ ⎣ p(θ1 , θ1 ) p(θ1 , θ2 ) . . . p(θ1 , θk−1 ) ... .. . p(θ1 , θk ) ⎤
[t]

The ﬁrst term in p(), constant across rows, indicates where the chain is at the current period and the column indicates potential destinations. Each matrix element is a well-behaved probability, p(θi , θj ) ≥ 0, ∀i, j ∈ A. The notation here can be a little bit confusing as it looks like a joint distribution in the sense of Chapter 7. This is an unfortunate artifact, and one just has to remember the different context. The rows of K sum to one and deﬁne a conditional PMF because they are all speciﬁed for the same starting value and cover each possible destination in the state space. We can also use this kernel to calculate state probabilities for arbitrary future times. If we multiply the transition matrix (kernel) by itself j times, the result,

9.2 Deﬁning Stochastic Processes and Markov Chains

399

Kj , gives the j-step transition matrix for this Markov chain. Each row is the set of transition probabilities from that row state to each of the other states in exactly j iterations of the chain. It does not say, however, what that sequence is in exact steps.

Example 9.2:

Campaign Contributions. It is no secret that individuals

who have contributed to a Congress member’s campaign in the past are more likely than others to contribute in the next campaign cycle. This is why politicians keep and value donor lists, even including those who have given only small amounts in the past. Suppose that 25% of those contributing in the past to a given member are likely to do so again and only 3% of those not giving in the past are likely to do so now. The resulting transition matrix is denoted as follows:
current period

θ2 θ1 ⎧ ⎡ ⎤ ⎪ ⎨ θ1 ⎢ 0.97 0.03 ⎥ last period ⎣ ⎦, ⎪ ⎩ θ2 0.75 0.25

where θ1 is the state for no contribution and θ2 denotes a contribution. Notice that the rows necessarily add to one because there are only two states possible in this space. This articulated kernel allows us to ask some useful questions on this candidate’s behalf. If we start with a list of 100 names where 50 of them contributed last period and 50 did not, what number can we expect to have contribute from this list? In Markov chain language this is called a starting point or starting vector: S0 = 50 50 ;

that is, before running the Markov chain, half of the group falls in each category. To get to the Markov chain ﬁrst state, we simply multiply the

So we would expect to get contributions from 14 off of this list. Since incumbent members of Congress enjoy a repeated electoral advantage for a number of reasons, let us assume that our member runs more consecutive races (and wins!). If we keep track of this particular list over time (maybe they are especially wealthy or inﬂuential constituents), what happens to our expected number of contributors? We can keep moving the Markov chain forward in time to ﬁnd out: ⎡ Second state: S2 = 86 14 ⎣ ⎡ Third state: S3 = 94 6 ⎣ ⎡ Fourth state: S4 = 96 4 ⎣ 0.97 0.03 0.75 0.25 0.97 0.03 0.75 0.25 0.97 0.03 0.75 0.25 ⎤ ⎦= ⎤ ⎦= ⎤ ⎦= 96 4 . 96 4 94 6

We rounded to integer values at each step since by deﬁnition donors can only give or not give. It turns out that no matter how many times we run this chain forward from this point, the returned state will always be [96, 4], indicating an overall 4% donation rate. This turns out to be a very important property of Markov chains and the subject of the next section. In fact, for this simple example we could solve directly for the steady state S = [s1 , s2 ] by stipulating ⎡ s1 s2 ⎣ 0.97 0.03 0.75 0.25 ⎤ ⎦= s1 s2

(using s1 + s2 = 100), where the difference from above is due to rounding.

9.2.2 The Stationary Distribution of a Markov Chain Markov chains can have a stationary distribution: a distribution reached from iterating the chain until some point in the future where all movement probabilities are governed by a single probabilistic statement, regardless of time or position. This is equivalent to saying that when a Markov chain has reached its stationary distribution there is a single marginal distribution rather than the conditional distributions in the transition kernel. To be speciﬁc, deﬁne π(θ) as the stationary distribution of the Markov chain for θ on the state space A. Recall that p(θi , θj ) is the probability that the chain will move from θi to θj at some arbitrary step t, and π t (θ) is the corresponding marginal distribution. The stationary distribution satisﬁes π t (θi )p(θi , θj ) = π t+1 (θj ).
θi

This is very useful because we want the Markov chain to reach and describe a given marginal distribution; then it is only necessary to specify a transition kernel and let the chain run until the probability structures match the desired marginal. Example 9.3: Shufﬂing Cards. We will see here if we can use a Markov

chain algorithm to shufﬂe a deck of cards such that the marginal distribution is uniform: Each card is equally likely to be in any given position. So the objective (stationary distribution) is a uniformly random distribution in the deck: The probability of any one card occupying any one position is 1/52.

402

Markov Chains The suggested algorithm is to take the top card and insert it uniformly

randomly at some other point in the deck, and continue. Is this actually a Markov chain? What is the stationary distribution and is it the uniform distribution as desired? Bayer and Diaconis (1992) evaluated a number of these shufﬂing algorithms from a technical perspective. To answer these questions, we simplify the problem (without loss of generality) to consideration of a deck of only three cards numbered 1, 2, 3. The sample space for this setup is then given by A = {[1, 2, 3], [1, 3, 2], [2, 1, 3], [2, 3, 1], [3, 1, 2], [3, 2, 1]}, which has 3! = 6 elements from the counting rules given in Chapter 7. A sample chain trajectory looks like [1,3, 2] [3,1, 2] [1,3, 2] [3,2, 1] . . . Looking at the second step, we took the 3 card off the top of the deck and picked the second position from among three. Knowing that the current position is currently [3, 1, 2], the probabilities and potential outcomes are given by Action return to top of deck put in middle position put in bottom position Outcome [3, 1, 2] [1, 3, 2] [1, 2, 3] Probability
1 3 1 3 1 3

To establish the potential outcomes we only need to know the current position of the deck and the probability structure (the kernel) here. Being

9.2 Deﬁning Stochastic Processes and Markov Chains

403

aware of the position of the deck at time t = 2 tells us everything we need to know about the deck, and having this information means that knowing that the position of the deck at time t = 1 was [1, 3, 2] is irrelevant to calculating the potential outcomes and their probabilities in the table above. So once the current position is established in the Markov chain, decisions about where to go are conditionally independent of the past. It should also be clear so far that not every position of the deck is immediately reachable from every other position. For instance, we cannot move directly from [1, 3, 2] to [3, 2, 1] because it would require at least one additional step. The transition kernel assigns positive (uniform) probability from each state to each reachable state in one step and zero probability to all other states:

Let us begin with a starting point at [1, 2, 3] and look at the marginal distribution after each application of the transition kernel. Mechanically, we do this by pre-multiplying the transition kernel matrix by [1, 0, 0, 0, 0, 0] as the starting probability (i.e., a deterministic decision to begin at the speciﬁed point. Then we record the result, multiply it by the kernel, and continue. The ﬁrst 15 iterations produce the following marginal probability vectors:

Clearly there is a sense that the probability structure of the marginals have converged to a uniform pattern. This is as we expected, and it means that the shufﬂing algorithm will eventually produce the desired marginal probabilities. It is important to remember that these stationary probabilities are not the probabilities that govern movement at any particular point in the chain; that is still the kernel. These are the probabilities of seeing one of the six events at any arbitrary point in time unconditional on current placement. This difference may be a little bit subtle. Recall that there are three unavailable outcomes and three equal probability outcomes for each position of the deck. So the marginal distribution above cannot be functional as a state to state transition mechanism. What is the best guess as to the unconditional state of the deck in 10,000 shufﬂes? It is equally likely that any of the six states would be observed. But notice that this question ignores the state of the deck in 9, 999 shufﬂes. If we do not have this information, then the marginal distribution (if known) is the best way to describe outcome probabilities because it is the long-run probability of the states once the Markov chain is in its stationary distribution.

9.3 Properties of Markov Chains 9.3 Properties of Markov Chains

405

Markov chains have various properties that govern how they behave as they move around their state spaces. These properties are important because they determine whether or not the Markov chain is producing values that are useful to our more general purpose.

9.3.1 Homogeneity and Periodicity A Markov chain is said to be homogeneous at some step t if the transition probabilities at this step do not depend on the value of t. This deﬁnition implies that Markov chains can be homogeneous for some periods and non-homogeneous for other periods. The homogeneity property is usually important in that Markov chains that behave according to some function of their age are usually poor theoretical tools for exploring probability statements of interest. A related, and important, property is the period of a Markov chain. If a Markov chain operates on a deterministic repeating schedule of steps, then it is said to be a Markov chain of period-n, where n is the time (i.e., the number of steps) in the reoccurring period. It seems fairly obvious that a periodic Markov chain is not a homogeneous Markov chain because the period implies a dependency of the chain on the time t.

9.3.1.1 A Simple Illustration of Time Dependency As an illustration of homogeneity and periodicity, consider a simple Markov chain operating on a discrete state space with only four states, θ : 1, 2, 3, 4, illustrated by

where s is the number of steps that the chain moves to the right. The state space wraps around from 4 back to 1 so that s = 4 means that the chain returns to the same place. Does this chain actually have the Markovian property? Certainly it does, because movement is dictated only by the current location and the stipulated kernel. Is it periodic or homogeneous? Since there is no repetition or dependency on t in the kernel it is clearly both aperiodic and nonhomogeneous. How could this Markov chain be made to be periodic? Suppose that the kernel above was replaced by the cycling rule: [1, 2, 3, 4, 4, 3, 2, 1, 1, 2, 3, 4, 4, 3, 2, 1, 1, 2, 3, 4, 4, 3, 2, 1, . . .], then the period would be 8 and the chain would repeat this pattern forever. While the path of the chain depends on the time in the sense that the next deterministic step results from the current deterministic step, we usually just call this type of chain periodic rather than nonhomogeneous. This is because periodicity is more damaging to stochastic simulation with Markov chains than nonhomogeneity. To see the difference, consider the following kernel, which gives a nonhomogeneous Markov chain that is not periodic: θ < [t + 1] = runif [(t − 1):t] mod 4 + 1, where the notation runif [(t − 1):t] here means a random uniform choice of integer between t − 1 and t. So this chain does not have a repeating period, but it is clearly dependent on the current value of t. For example, now run this chain for 10 iterations from a starting point at 1:

There is a built-in periodicity to this chain where lower values are more likely just after t reaches a multiple of 4, and higher values are more likely just before t reaches a multiple of 4. There is one last house-keeping detail left for this section. Markov chains are generally implemented with computers, and the underlying random numbers generated on computers have two characteristics that make them not truly random. For one thing, the generation process is fully discrete since the values are created from a ﬁnite binary process and normalized through division. This means that these are pseudo-random numbers and necessarily rational. Yet, truly random numbers on some deﬁned interval are irrational with probability one because the irrationals dominate the continuous metric. In addition, while we call these values random or more accurately pseudorandom numbers, they are not random at all because the process that generates them is completely deterministic. The algorithms create a stream of values that is not random in the indeterminant sense but still resembles a random process. The deterministic streams varyingly lack systematic characteristics (Coveyou 1960): the time it takes to repeat the stream exactly (the period) and repeated patterns in lagged sets within the stream. So, by necessity, algorithmic implementations have to live with periodicity, and it is worth the time and energy in applied settings to use the available random number generator with the longest period possible.

9.3.2 Irreducibility A state A in the state space of a Markov chain is irreducible if for every two substates or individual events θi and θj in A, these two substates “communicate.” This means that the Markov chain is irreducible on A if every reached point or collection of points can be reached from every other reached point or

The chain determined by this kernel is reducible because if is started in either θ1 or θ2 , then it operates as if has the transition matrix: θ1 θ2
1 2 1 3 1 2 2 3

K1,2 =

⎛ θ1 ⎜ ⎝ θ2

⎞ ⎟ ⎠,

and if it is started in either θ3 or θ4 , then it operates as if it has the transition matrix: ⎛ θ3 ⎜ ⎝ θ4 θ3 θ4
3 4 1 4 1 4 3 4

K3,4 =

⎞ ⎟ ⎠.

Thus the original Markov chain determined by K is reducible to one of two forms, depending on where it is started because there are in each case two permanently unavailable states. To provide a contrast, consider the Markov chain determined by the following kernel, K , which is very similar to K but

9.3 Properties of Markov Chains is irreducible: ⎛ θ1 θ2 θ3 θ4
1 2 1 2

409

θ1 K= θ2 θ3 θ4

⎜ ⎜ ⎜ 1 ⎜ 3 ⎜ ⎜ ⎜ 0 ⎜ ⎝ 0

0
2 3

0

⎞

0
3 4

0
1 4

0

⎟ ⎟ 0 ⎟ ⎟ ⎟. ⎟ 1 ⎟ 4 ⎟ ⎠
3 4

This occurs because there is now a two-way “path” between the previously separated upper and lower submatrices. Related to this is the idea of hitting times. The hitting time of a state A and a Markov chain θ is the shortest time for the Markov chain to begin in A and return to A: TA = inf[n > 0, θ[n] ∈ A]. Recall that the notation “inf” from Chapter 1 means the lowest (positive) value of n that satisﬁes θ[n] ∈ A. It is conventional to deﬁne TA as inﬁnity if the Markov chain never returns to A. From this deﬁnition we get the following important result: An irreducible and aperiodic Markov chain on state space A will for each subspace of A, ai , have a ﬁnite hitting time for ai with probability one and a ﬁnite expected value of the hitting time: p(Tai < ∞) = 1, E[Tai ] < ∞.

By extension we can also say that the probability of transitioning from two substates of A, ai and aj , in ﬁnite time is guaranteed to be nonzero.

9.3.3 Recurrence Some characteristics belong to states rather than Markov chains. Of course Markov chains operating on these states are affected by such characteristics. A

410

Markov Chains

state A is said to be absorbing if once a Markov chain enters this state it cannot leave: p(A, A ) = 0. Conversely, A is transient if the probability of the chain not returning to this state is nonzero: 1 − p(A, A) > 0. This is equivalent to saying that the chain will return to A for only a ﬁnite number of visits in inﬁnite time. State A is said to be closed to another state B if a Markov chain on A cannot reach B: p(A, B) = 0. State A is clearly closed in general if it is absorbing since B = A in this case (note that this is a different deﬁnition of “closed” used in a different context than that in Chapter 4). These properties of states allow us to deﬁne an especially useful characteristic of both states and chains. If a state is closed, discrete, and irreducible, then this state and all subspaces within this subspace are called recurrent, and Markov chains operating on recurrent state spaces are recurrent. From this we can say something important in two different ways:

• [Formal Deﬁnition.] A irreducible Markov chain is called recurrent with regard to a given state A, which is a single point or a deﬁned collection of points, if the probability that the chain occupies A inﬁnitely often over unbounded time is nonzero. • [Colloquial Deﬁnition.] When a chain moves into a recurrent state, it stays there forever and visits every subspace inﬁnitely often. There are also exactly two different, mutually exclusive, “ﬂavors” of recurrence with regard to a state A: • A Markov chain is positive recurrent if the mean time to return to A is bounded. • Otherwise the mean time to return to A is inﬁnite, and the Markov chain is called null recurrent. With these we can also state the following properties.

9.3 Properties of Markov Chains

411

Properties of Markov Chain Recurrence Unions If A and B are recurrent states, then A ∪ B is a recurrent state Capture A chain that enters a closed, irreducible, and recurrent state stays there and visits every substate with probability one.

Example 9.4: ties.

Conﬂict and Cooperation in Rural Andean Communi-

Robbins and Robbins (1979) extended Whyte’s (1975) study of 12

Peruvian communities by extrapolating future probabilities of conﬂict and cooperation using a Markov chain analysis. Whyte classiﬁed these communities in 1964 and 1969 as having one of four types of relations with the other communities: high cooperation and high conﬂict (HcHx), high cooperation

and low conﬂict (HcLx), low cooperation and high conﬂict (LcHx), or low cooperation and low conﬂict (LcLx). The interesting questions were, what patterns emerged as these communities changed (or not) over the ﬁve-year period since conﬂict and cooperation can exist simultaneously but not easily. The states of these communities at the two points in time are given in Table 9.1. So if we are willing extrapolate these changes as Robbins and Robbins did by assuming that “present trends continue,” then a Markov chain transition matrix can be constructed from the empirically observed changes between 1964 and 1969. This is given the following matrix, where the rows indicate 1964 starting points and the columns are 1969 outcomes:

The ﬁrst thing we can notice is that HcHx and LcLx are both absorbing states as described above: Once the Markov chain reaches these states it never leaves. Clearly this means that the Markov chain is not irreducible because there are states that cannot “communicate.” Interestingly, there are two noncommunicating state spaces given by the 2 × 2 upper left and lower right submatrices. Intuitively it seems that any community that starts out as HcHx or HcLx ends up as HcHx (upper left), and any community that starts out as LcHx or LcLx ends up as LcLx (lower right). We can test this by running the Markov chain for some reasonable number of iterations and observing the limiting behavior. It turns out that it takes about 25 iterations

9.3 Properties of Markov Chains

413

(i.e., 25 ﬁve-year periods under the assumptions since the 0.75 value is quite persistent) for this limiting behavior to converge to the state:

but once it does, it never changes. This is called the stationary distribution of the Markov chain and is now formally deﬁned.

9.3.4 Stationarity and Ergodicity In many applications a stochastic process eventually converges to a single limiting value and stays at that value permanently. It should be clear that a Markov chain cannot do that because it will by deﬁnition continue to move about the parameter space. Instead we are interested in the distribution that the Markov chain will eventually settle into. Actually, these chains do not have to converge in distribution, and some Markov chains will wander endlessly without pattern or prediction. Fortunately, we know some criteria that provide for Markov chain convergence. First, deﬁne a marginal distribution of a Markov chain. For a Markov chain operating on a discrete state space, the marginal distribution of the chain at the m step is obtained by inserting the current value of the chain, θi , into the row of the transition kernel for the mth step, pm : pm (θ) = [pm (θ1 ), pm (θ2 ), . . . , pm (θk )].
[m]

414

Markov Chains

So the marginal distribution at the very ﬁrst step of the discrete Markov chain is given by p1 (θ) = p1 π 0 (θ), where p0 is the initial starting value assigned to the chain and p1 = p is a transition matrix. The marginal distribution at some (possibly distant) step for a given starting value is pn = ppn−1 = p(ppn−2 ) = p2 (ppn−3 ) = . . . = pn p0 . Since successive products of probabilities quickly result in lower probability values, the property above shows how Markov chains eventually “forget” their starting points. Now we are prepared to deﬁne stationarity. Recall that p(θi , θj ) is the probability that the chain will move from θi to θj at some arbitrary step t, and π t (θ) is the corresponding marginal distribution. Deﬁne π(θ) as the stationary distribution (a well-behaved probability function in the Kolmogorov sense) of the Markov chain for θ on the state space A, if it satisﬁes π t (θi )p(θi , θj ) = π t+1 (θj ).
θi

The key point is that the marginal distribution remains ﬁxed when the chain reaches the stationary distribution, and we might as well drop the superscript designation for iteration number and just use π(θ); in shorthand, π = πp. Once the chain reaches its stationary distribution, it stays in this distribution and moves about, or “mixes,” throughout the subspace according to marginal distribution, π(θ), indeﬁnitely. The key theorem is An irreducible and aperiodic Markov chain will eventually converge to a stationary distribution, and this stationary distribution is unique. Here the recurrence gives the range restriction property whereas stationarity gives the constancy of the probability structure that dictates movement. As you might have noticed by now in this chapter, Markov chain theory is full of new terminology. The type of chain just discussed is important enough

9.3 Properties of Markov Chains

415

to warrant its own name: If a chain is recurrent and aperiodic, then we call it ergodic, and ergodic Markov chains with transition kernel K have the property
n→∞

lim K n (θi , θj ) = π(θj ),

for all θi and θj in the subspace What does this actually means? Once an ergodic Markov chain reaches stationarity, the resulting values are all from the distribution π(θi ). The Ergodic Theorem given above is the equivalent of the strong law of large numbers but instead for Markov chains, since it states that any speciﬁed function of the posterior distribution can be estimated with samples from a Markov chain in its ergodic state because averages of sample values give strongly consistent parameter estimates. The big deal about ergodicity and its relationship to stationarity comes from the important ergodic theory. This essentially states that, given the right conditions, we can collect empirical evidence from the Markov chain values in lieu of analytical calculations. Speciﬁcally If θn is a positive recurrent, irreducible Markov chain with stationary distribution given by π(θ), then
n→∞ n

lim

1

f (θn ) =
Θ

f (θ)π(θ).

Speciﬁcally, this means that empirical averages for the function f () converge to probabilistic averages. This is the justiﬁcation for using Markov chains to approximate difﬁcult analytical quantities, thus replacing human effort with automated effort (at least to some degree!). Example 9.5: Population Migration Within Malawi. Discrete Markov

chains are enormously useful in describing movements of populations, and demographers often use them in this way. As an example Segal (1985) looked at population movements between Malawi’s three administrative regions from 1976 to 1977. The Republic of Malawi is a narrow, extended south African country of 45,745 square miles wrapped around the eastern and southern parts of Lake Malawi. Segal took observed migration numbers

416

Markov Chains

to create a transition matrix for future movements under the assumption of stability. This is given by Destination
Northern Central Southern

Source

⎛
Northern

Central

Southern

⎜ ⎜ ⎜ 0.005 0.983 0.012 ⎜ ⎝ 0.004 0.014 0.982

0.970 0.019 0.012

⎞ ⎟ ⎟ ⎟. ⎟ ⎠

It is important to note substantively that using this transition matrix to make future predictions about migration patterns ignores the possibility of major shocks to the system such as pandemics, prolonged droughts, and political upheaval. Nonetheless, it is interesting, and sometimes important, to anticipate population changes and the subsequent national policy issues. The obvious question is whether the transition matrix above deﬁnes an ergodic Markov chain. Since this is a discrete transition kernel, we need to assert that it is recurrent and aperiodic. This is a particularly simple example because recurrence comes from the lack of zero probability values in the matrix. Although the presence of zero probability values alone would not be proof of nonreccurence, the lack of any shows that all states communicate with nonzero probabilities and thus recurrence is obvious. Note that there is also no mechanism to impose a cycling effect through the cell values so aperiodicity is also apparent. Therefore this transition kernel deﬁnes an ergodic Markov chain that must then have a unique stationary distribution. While there is no proof of stationarity,long periods of unchanging marginal probabilities are typically a good sign, especially with such a simple and well-behaved case. The resulting stationary distribution after multiplying the transition kernel 600 times is

9.3 Properties of Markov Chains

417

Destination Northern Central Southern .

0.1315539 0.4728313 0.3956149

We can actually run the Markov chain much longer without any real trouble, but the resulting stationary distribution will remain unchanged from this result. This is a really interesting ﬁnding, though. Looking at the original transition matrix, there is a strong inclination to stay in the same region of Malawi for each of the three regions (the smallest has probability 0.97), yet in the limiting distribution there is a markedly different result, with migration to the Central Region being almost 50%. Perhaps more surprisingly, even though there is a 0.97 probability of remaining in the Northern Region for those starting there on any given cycle, the long-run probability of remaining in the Northern Region is only 0.13.

9.3.5 Reversibility Some Markov chains are reversible in that they perform the same run backward as forward. More speciﬁcally, if p(θi , θj ) is a single probability from a transition kernel K and π(θ) is a marginal distribution, then the Markov chain is reversible if it meets the condition p(θi , θj )π(θi ) = p(θj , θi )π(θj ). This expression is called both the reversibility condition and the detailed balance equation. What this means is that the distribution of θ at time t + 1 conditioned on the value of θ at time t is the same as the distribution of θ at time t conditioned on the value of θ at time t + 1. Thus, for a reversible Markov chain the direction of time is irrelevant to its probability structure.

418

Markov Chains

As an example of reversibility, we modify a previous example where the probability of transitioning between adjacent states for a four-state system is determined by ﬂipping a fair coin (states 1 and 4 are assumed adjacent to complete the system):

θ1
p= 1 2

θ2
⇐⇒
p= 1 2

θ3
⇐⇒
p= 1 2

θ4
⇐⇒
p= 1 2

⇐=

1

2

3

4

p= 1 2

=⇒

It should be clear that the stationary distribution of this system is uniform across the four events, and that it is guaranteed to reach it since it is recurrent and aperiodic. Suppose we modify the transition rule to be asymmetric from every point, according to

θ1
9 p= 10

θ2
p=
1 10 = ⇒ 9 p= 10
⇐ =

θ3
p=
1 10 = ⇒ 9 p= 10
⇐ =

θ4
p=
1 10 = ⇒ 9 p= 10
⇐ =

⇐=

1

2

3

4

1 p= 10

=⇒

So what we have now is a chain that strongly prefers to move left at every step by the same probability. This Markov chain will also lead to a uniform stationary distribution, because it is clearly still recurrent and aperiodic. It is, however, clearly not reversible anymore because for adjacent θ values (i.e., those with nonzero transition probabilities) p(θi , θj )π(θi ) = p(θj , θi )π(θj ), i < j 9 1 1 1 = 10 4 10 4 (where we say that 4 < 1 by assumption to complete the system).

Consider a lone knight on a chessboard making moves uniformly randomly from those legally available to it at each step. Show that the path of the knight (starting in a corner) is or is not a Markov chain. If so, is it irreducible and aperiodic?

9.2

For the following matrix, ﬁll in the missing values that make it a valid transition matrix: ⎡ ⎤

⎢ ⎢ ⎢ 0.9 ⎢ ⎢ ⎢ 0.0 ⎣ 9.3 Using this matrix:

0.1

0.2

⎥ ⎥ 0.01 0.01 ⎥ ⎥. ⎥ 0.0 0.0 ⎥ ⎦ 0.2 0.2 0.2 ⎡ ⎤ ⎦,

0.3

X=⎣

1 4 1 2

3 4 1 2

ﬁnd the vector of stationary probabilities. 9.4 Consider a discrete state space with only two events: 0 and 1. A stochastic process operates on this space with probability of transition one-half for moving or staying in place. Show that this is or is not a Markov chain. 9.5 There are many applications and famous problems for Markov chains that are related to gambling. For example, suppose a gambler bets 1 on successive games unless she has won three in a row. In the

latter case she bets 3 but returns to 1 if this game is lost. Does this dependency on more than the last value ruin the Markovian property? Can this process be made to depend only on a previous “event”? 9.6 One urn contains 10 black marbles and another contains 5 white marbles. At each iteration of an iterative process 1 marble is picked from each urn and swapped with probability p = 0.5 or returned to its original urn with probability 1 − p = 0.5. Give the transition matrix

Exercises

421

for the process and show that it is Markovian. What is the limiting distribution of marbles? 9.7 Consider the prototypical example of a Markov chain kernel: ⎡ ⎣ p 1−q 1−p q ⎤ ⎦,

(a) For the starting point [0.6, 0.4] calculate the ﬁrst 10 chain values. (b) For the starting point [0.1, 0.9] calculate the ﬁrst 10 chain values. (c) Does this transition matrix deﬁne an ergodic Markov chain? (d) What is the limiting distribution, if it exists? 9.9 Suppose that for a Congressional race the probability that candidate B airs negative campaign advertisements in the next period, given that candidate A has in the current period, is 0.7; otherwise it is only 0.07. The same probabilities apply in the opposite direction. Answer the following questions. (a) Provide the transition matrix. (b) If candidate B airs negative ads in period 1, what is the probability that candidate A airs negative ads in period 3? (c) What is the limiting distribution? 9.10 Duncan and Siverson (1975) used the example of social mobility where grandfather’s occupational class affects the grandson’s occupational

where the grandfather’s occupational class deﬁnes the rows and the grandson’s occupational class deﬁnes the columns. What is the longrun probability of no social mobility for the three classes? Their actual application is to Sino-Indian relations from 1959 to 1964, where nine communication states are deﬁned by categorizing each country’s weekly communication with the other as high (3 or more), medium (1 or 2), or low (zero), at some point in time:

This setup nicely ﬁts the Markovian assumption since interstate communication is inherently conditional. Their (estimated) transition

Exercises matrix, expressed as percentages instead of probabilities, is

423

1

⎛

1

2

3

4

5

6 2 4 6 6 8

7 3 4 0 6 4 6 0 7

8 2 6 8 9 17 8

9 2 0 0 3 8 8

P =

⎜ 2 ⎜ ⎜ ⎜ ⎜ 3 ⎜ ⎜ ⎜ 4 ⎜ ⎜ ⎜ 5 ⎜ ⎜ ⎜ 6 ⎜ ⎜ ⎜ ⎜ 7 ⎜ ⎜ ⎜ 8 ⎜ ⎝ 9

50 13 2 15 26 2 8 38 8 19 19 9 18 14 4 8 25 4 0 8 8 7 8 0 7

15 13 15 28 0 9 6 7 0 19

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

15 23

12 16 25 17 6

16 10

19 19 11 29

11 18

10 0

29 06 48

What evidence is there for the claim that there is “a tendency for China to back down from high levels of communication and a certain lack of responsiveness for India”? What other indications of “responsiveness” are found here and what can you conclude about the long-run behavior of the Markov chain? 9.11 Given an example transition matrix that produces a nonirreducible Markov chain, and show that it has at least two distinct limiting distributions. 9.12 For the following matrices, ﬁnd the limiting distribution: ⎡ ⎣ ⎤ ⎦ ⎡ ⎣ ⎤ ⎦ ⎡ 0.75 0.25 0.0 ⎤ ⎥ ⎥ ⎥. ⎦

Chung (1969) used Markov chains to analyze hierarchies of human needs in the same way that Maslow famously considered them. The key point is that needs in a speciﬁc time period are conditional on the needs in the previous period and thus are dynamic moving up and down from basic to advanced human needs. Obviously the embedded assumption is that the pattern of needs is independent of previous periods conditional on the last period. In a hypothetical example Chung constructed a proportional composition of needs for some person according to N = (Nph , Nsf , Nso , Nsr , Nsa ) = (0.15, 0.30, 0.20, 0.25, 0.10), which are supposed to reﬂect transition probabilities from one state to another where the states are from Maslow’s hierarchy: physiological, safety, socialization, self-respect, and self-actualization. Furthermore, suppose that as changes in socio-economic status occur, the composition of needs changes probabilistically according to ⎛ Nph Nsf Nso Nsr Nsa ⎞

Verify his claim that the system of needs reaches a stationary distribution after four periods using the starting point deﬁned by N . How does this model change Maslow’s assumption of strictly ascending needs? 9.15 Consider the following Markov chain from Atchad´ and Rosenthal e (2005). For the discrete state space Θ = {1, 3, 4}, at the nth step produce the n + 1st value by: • if the last move was a rejection, generate θ ∼ uniform(θn − 1 : θn + 1); • if the last move was an acceptance, generate θ ∼ uniform(θn − 2 : θn + 2); • if θ ∈ Θ, accept θ as θn+1 , otherwise reject and set θn as θn+1 ; where these uniform distributions are on the inclusive positive integers and some arbitrary starting point θ0 (with no previous acceptance) is assumed. What happens to this chain in the long run? 9.16 Markov chain analysis can be useful in game theory. Molander (1985) constructed the following matrix in his look at “tit-for-tat” strategies in international relations: ⎡ (1 − p)2 p(1 − p) p(1 − p) p2 ⎤ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎦

There are four outcomes where the rows indicate action by player C and the columns indicate a probabilistic response by player D for some stable probability p. Show that this is a valid transition matrix and ﬁnd the vector of stationary probabilities. Molander modiﬁed this game to allow players the option of “generosity,” which escapes the

426

Markov Chains cycle of vendetta. This 4 × 4 transition matrix is given by

where c is the probability that player C deviates from tit-for-tat and d is the probability that player D deviates from tit-for-tat. Show that this matrix deﬁnes a recurrent Markov chain and derive the stationary
1 distribution for p = c = d = 2 .

9.17

Dobson and Meeter (1974) modeled the movement of party identiﬁcation in the United States between the two major parties as a Markovian. The following transition matrix gives the probabilities of not moving from one status to another conditional on moving (hence the zeros on the diagonal):