To send content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about sending content to .

To send content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about sending to your Kindle.

Note you can select to send to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

By using this service, you agree that you will only keep articles for personal use, and will not openly distribute them via Dropbox, Google Drive or other file sharing services
Please confirm that you accept the terms of use.

Political science research frequently models binary or ordered outcomes involving related processes. However, traditional modeling of these outcomes ignores common data issues and cannot capture nuances. There is often an excess of zeros, the observed outcomes for different actors are inherently related, and competing actors may respond to the same factors differently. This paper extends existing models and develops a zero-inflated multivariate ordered probit to simultaneously address these issues. This model performs better than existing models at capturing the true parameters of interest, estimates the nature of the related processes, and captures the differences in actors’ decision-making. I demonstrate these benefits through simulation exercises and an application to party behavior in Mexico.

How does wartime exposure to ethnic violence affect the political preferences of ordinary citizens? Are high-violence communities more or less likely to reject the politicization of ethnicity post-war? We argue that community-level experience with wartime violence solidifies ethnic identities, fosters intra-ethnic cohesion and increases distrust toward non-co-ethnics, thereby making ethnic parties the most attractive channels of representation and contributing to the politicization of ethnicity. Employing data on wartime casualties at the community level and pre- as well as post-war election results in Bosnia, we find strong support for this argument. The findings hold across a number of robustness checks. Using post-war survey data, we also provide evidence that offers suggestive support for the proposed causal mechanism.

Scholars are increasingly utilizing online workforces to encode latent political concepts embedded in written or spoken records. In this letter, we build on past efforts by developing and validating a crowdsourced pairwise comparison framework for encoding political texts that combines the human ability to understand natural language with the ability of computers to aggregate data into reliable measures while ameliorating concerns about the biases and unreliability of non-expert human coders. We validate the method with advertisements for U.S. Senate candidates and with State Department reports on human rights. The framework we present is very general, and we provide free software to help applied researchers interact easily with online workforces to extract meaningful measures from texts.

Quantitative Methods in Archaeology Using R is the first hands-on guide to using the R statistical computing system written specifically for archaeologists. It shows how to use the system to analyze many types of archaeological data. Part I includes tutorials on R, with applications to real archaeological data showing how to compute descriptive statistics, create tables, and produce a wide variety of charts and graphs. Part II addresses the major multivariate approaches used by archaeologists, including multiple regression (and the generalized linear model); multiple analysis of variance and discriminant analysis; principal components analysis; correspondence analysis; distances and scaling; and cluster analysis. Part III covers specialized topics in archaeology, including intra-site spatial analysis, seriation, and assemblage diversity.

We have discussed the concepts of multivariate spaces and distances in earlier chapters. With discriminant analysis, we created a space that separated groups of observations. With principal components, we defined a multivariate space for the observations in fewer dimensions while losing the least amount of information. With correspondence analysis, we displayed observations and variables in terms of their Chi-square distances. This chapter is the first of two that focus on quantitative methods that start with a distance matrix and represent the distances between observations either in the form of a map (this chapter) or by grouping observations that are similar (Chapter 15). In both cases we generally start by computing a distance or similarity measure between pairs of observations.

The first part of the chapter describes different ways of defining distance and similarity. Different choices in the measurement of distance can have substantial influence on the results. The second part describes ways of analyzing distance matrices in order to represent them in the form of a map. If you have ever looked at a highway map, you may have noticed a triangular table of distances between major cities. What if you only had the table of distances? How would you go about reconstructing the map? Scaling methods were primarily developed in the field of psychology where data expressing perception or preferences are gathered directly in the form of a similarity matrix that reflects judgments regarding how similar pairs of stimuli are. In archaeology, the classic application is seriation, which attempts to represent variation in assemblages along a single dimension that may represent time (Chapter 17). The third part of the chapter illustrates how to compare two distance matrices. For example, if we have sites located in a region and collections of ceramics from those sites, how do we compare the geographic distances to ceramic assemblage distances?

DISTANCE, DISSIMILARITY, AND SIMILARITY

The term distance refers to a numeric score that indicates how close or far two observations are in terms of a set of variables. Larger distances mean the observations are less similar to one another. Dissimilarity measures are larger when two objects are more distant or different from one another. Similarity measures are larger when two objects are more like one another.

A table is simply a two-dimensional presentation of data or a summary of the data. We use tables to inspect the original data for errors or problems such as missing entries. We used tables to present condensed summaries of data values in Chapter 3 (e.g., numSummary()). Those summaries involved computing summary statistics by a categorical variable to see how the groups differed from one another. We can also use tables to see how categorical variables covary.

Nominal or categorical data play a large role in archaeological research. At the regional level, sites are the categories and we are interested in the number of different types of artifacts (also a category) found in each site. The same applies at the site level where the artifact categories are distributed across excavation units. Within sites, different kinds of features are present and features contain different types of artifacts. At the artifact level, some properties of artifacts are represented by categories. Because of this, the same data are often represented in different ways for different purposes. That is not a problem unless the statistical procedures we are using expect a format different from the one we are currently using. In Chapter 3, we created tables of descriptive statistics. In this chapter we are concerned with tables in which the cell entries consist of counts of objects.

R distinguishes between tables and data frames and some functions will work with one but not the other. Data frames have columns that represent different types of data (e.g., character strings, factors, numbers), but tables in R represent numeric data only. In fact, R tables are a kind of matrix. Before constructing tables, we will briefly describe how R encodes categorical data using factors.

FACTORS IN R

Factors are a way of storing categorical information in R. If you have coded a variable into a set of categories, you have the choice of storing the information as a character or factor vector. A factor stores each category as an integer and the category labels are stored as levels. If you import your data into a data frame, R will automatically convert character vectors into factors unless you use the argument stringsAsFactors=FALSE.

Archaeological data come in all sizes, shapes, and quantities ranging from Egyptian pyramids (large in size, small in the number of specimens) to micro-debitage from a lithic workshop or molecular residues in a ceramic bowl. Because the questions we ask of the data are different, our representations of those data differ. One way of representing the data dominates however, because it is so flexible. That is a rectangular arrangement of data so that each row represents an observation and each column represents a measurement on that observation. Some of those measurements can be counts, and each count is a potential observation for another data table.

For example, we may have located a variety of archaeological sites in a river valley. One data table could consist of the grid units that were surveyed so that each row of the table is a grid square (e.g., 100 m on a side). The columns of the data set include the coordinates of the unit and the number of sites and isolated artifact finds discovered during the survey. There could be other columns identifying when the unit was surveyed and information about the location of the unit with respect to topographic features such as dominant soil type, major waterways, lakes, and so on. This data set would be relevant to exploring questions about site density. For example, are there more sites near water features and fewer in upland areas away from any water source?

Each of the counts in this data set is a potential row in another data set. That data set consists of a row for each site and columns for the location of the site, the area of the site, the physical characteristics around the site (e.g., slope, elevation, aspect, soil type), and the number of different kinds of artifacts and features found on the site. This data set would be relevant to questions regarding where sites are located and how the artifacts and features found on sites differ.

Each of the artifacts and features in the site data set is a potential row in another data set (or more likely multiple data sets). At this point it may make sense to create separate data sets for projectile points, flakes, cores, pottery sherds, shells, bones, and other categories of material.

Correspondence analysis provides a way to summarize categorical data in a reduced number of dimensions (Clausen, 1998; Greenacre, 2007). In that sense, it is very similar to principal components analysis. Principal components is an asymmetrical analysis. We use the correlations (or covariances) between the variables as a summary of the structure in the data. The principal components represent a way of describing the correlation matrix in fewer components than variables. The analysis is asymmetrical because we focus on the relationships between variables and use the principal components to compute scores for each of the observations in the new, reduced space.

In correspondence analysis, the data usually consist of counts of different kinds of things. They could be different artifact types from a variety of sites, strata, or features or they could be different elements in the composition of artifacts. Correspondence analysis is a symmetrical analysis because we adjust the data matrix by both the rows (observations) and the columns (variables) before conducting the analysis. As a result, we can project the observations into the space defined by the variables (as with principal components) or the variables into the space defined by the observations. We can also create biplots summarizing both views.

The adjustment of the data matrix is simply a modification of the Chi-square test that we covered in Chapter 9. In the Chi-square test we compute an expected value for a particular cell by multiplying the row sum by the column sum and dividing by the total sum. The difference between the observed and expected values is squared and divided by the expected value to get the Chi-square contribution for that cell. The sum of all the Chi-square contributions is the total Chi-square value that we use to see if the observed counts are significantly different from what we would expect by chance.

To perform a correspondence analysis, we modify that procedure slightly. First, we divide every value in the table by the sum of all the entries so that each cell represents the proportion of the total found in that cell. Then we compute the expected proportions using the row and column sums of the table of proportions.

Archaeological assemblages are collections of artifacts that have been assigned to different groups. The boundaries defining an assemblage can be a whole site, the part of the site excavated, a feature within a site (e.g., pit, grave, house), or an arbitrary unit defined in terms of horizontal and vertical space (Level 7 of unit N302E200). A description of an assemblage includes how its boundaries are defined and how many of each kind of archaeological material was present within those boundaries.

One of the challenges in analyzing archaeological assemblages is that the factors that control the counts are usually not controlled by the archaeologist. The boundaries of the assemblage usually do not represent a consistent amount of time from one assemblage to another. It does not matter if time is measured in years or person-years, we cannot assume that the amount of time is the same between assemblages except in rare circumstances such as graves and shipwrecks. Artifact composition, whether the result of a geological event (obsidian) or a behavioral event (ceramics) does not include the same level of uncertainty. Artifact composition is expressed using some similar measure that standardizes abundance (e.g., percent, per mil, ppm, ppb). Artifact assemblages may be similarly standardized in terms of percentage of the whole assemblage or only those items under analysis (e.g., ceramics, but not lithics; faunal material, but not botanical material; lithic artifacts, but not lithic debitage) or in terms of density (items per volume), but usually we are not certain that volume means the same thing across the various assemblages under consideration.

This means the analysis of artifact assemblages is similar, but different from artifact composition and from species composition in ecological communities. While it makes sense to borrow from the approaches used by both, it is important to recognize the differences. In comparison with ecological communities, artifact types are less clear-cut than species. The assemblage represents an accumulation of discarded material rather than the observation of living individuals present at a particular point in time and in that sense archaeological assemblages are more similar to fossil communities. Instead of the niches occupied by biological species, artifacts occupy space defined by human interaction with the physical environment and with social networks.

Raw data comes in many sizes and shapes and occasionally they are the wrong sizes and shapes for what we want to do with them. In those situations, it can be useful to transform them before analysis. Transforming data is often useful to balance a non-symmetric distribution or to pull in outlying observations to reduce their influence in the analysis. Transformations can be applied down columns (e.g., standard scores to weight each variable equally) or across rows (e.g., percentages to weight each assemblage equally). In general, there are four data problems that can sometimes be resolved with transformations.

First, transformations can help to produce a distribution that is closer to a normal distribution, making it possible to use parametric statistical methods (such as t-tests). In this case, we are looking at the raw data distribution and using an order-preserving transformation that makes the data more symmetrical. The alternative to transforming the data is to use nonparametric tests that do not require a normal distribution or robust statistical methods that are not as influenced by extremely large or small values.

Second, transformations can make it possible to use simple linear regression to fit nonlinear relationships between two variables. Transforming one or both variables makes the relationship between them linear. The drawback with this approach is that the errors are transformed as well so that additive errors become multiplicative errors when using a log transform. The alternative to transformation is to use nonlinear regression.

Third, transformations can be used to weight variables equally so that differences in measurement scales or variance do not give some variables more influence than others in the analysis. This is particularly important when we are using the concept of “distance” between observations (Chapter 14).

Fourth, transformations can be used to control for size differences between assemblages or specimens that we want to exclude from the analysis in order to focus on shape or relationships between variables that are independent of differences in size. In this case the transformation is applied to the rows of the data. First, we will consider a collection of R functions that are useful for a number of purposes, including transformation.

Archaeology is the study of human culture and behavior through its material evidence. Although archaeology sometimes works with the material evidence of contemporary societies (ethnoarchaeology) or historical societies (historical archaeology and classical archaeology), for most of our past, the archaeological record is the only source of information. What we can learn about that past must come from surviving artifacts and modifications of the earth's surface produced by human activity. Fortunately, people tend to be messy.

Our basic sources of evidence consist of artifacts, waste products produced during the manufacture of artifacts or their use, food waste, ground disturbances including pits and mounds, constructions that enclose spaces such as buildings and walls, and the physical remains of people themselves. Study of this evidence includes identification of the raw materials used, what modifications occurred to make the object useful, and the physical shape and dimensions of the final product. Wear and breakage of the object and its repair are also examined.

In addition to its life history, each object has a context. It was discovered in a particular part of a site, in a particular site in a region, occupied by humans at a particular time. Together these make up the three dimensions that Albert Spaulding referred to as the “dimensions of archaeology” (Spaulding 1960).

Our discovery and analysis of archaeological evidence is directed toward the broad goal of understanding our past. The range of questions archaeologists are attempting to answer about the past is substantial. Broadly they could be grouped into a number of big questions:

1. How did our ancestors come to develop a radically new way of living that involved changes in locomotion (bipedalism), increasing use of tools, the formation of social groups unlike any other living primate, and increases in cranial capacity? Quantitative methods are used to identify sources of raw material for stone tools to determine how far they were transported. They are also used to classify stone tools, to compare the kinds of tools and the kinds of animals found at different sites, and to look for correlations between the distributions of stone tools and animal bones.

In Chapter 10, we expanded on linear regression by using more than one explanatory variable on the right-hand side of the formula. In this chapter, we will expand on t-tests and analysis of variance from Chapter 8 by adding more than one response variable on the left-hand side of the formula. Hotelling's T test is a multivariate expansion of the t-test and multivariate analysis of variance (MANOVA) is a multivariate expansion of analysis of variance. In many cases, we have multiple measures of artifact shape or composition and running t-tests separately on each variable creates multiple comparisons problems. Also the tests are not really independent if the variables are correlated with one another as they often are. Hotelling's T and MANOVA provide an overall test of the difference between the groups based on all of the numeric variables. The tests of significance are on these linear combinations rather than the original separate variables.

Discriminant analysis involves a similar process in that we are looking for linear combinations of variables that allow us to predict a categorical variable. The most common archaeological application is in compositional analysis where we are trying to characterize different sources (geological sources or manufacturing sources) on the basis of molecular or elemental composition. Discriminant analysis includes two separate but related analyses. One is the description of differences between groups (descriptive discriminant analysis) and the second involves predicting to what group an observation belongs (predictive discriminant analysis, Huberty and Olejink 2006).

Descriptive discriminant analysis is based on multivariate analysis of variance. Instead of a single numeric dependent (response) variable, we have several variables. To test for differences between groups, we compute linear combinations of the original variables and then test for significant differences between the linear combinations. A linear combination is like a multiple regression equation in the sense that each variable is multiplied by a value and summed to produce a new value that summarizes variability in the original variables. Descriptive discriminant analysis is also described as canonical discriminant analysis and the linear components are referred to as canonical variates. The method is used to visualize the similarities and differences between groups in two or three dimensions.