What is a codebook?

A codebook provides information on the structure, contents,and layout of a data file. Users are strongly encouraged to look at the codebook of a study before downloading the datafiles.
While codebooks vary widely in quality and amount of information given, a typical codebook includes:

Column locations and widths for each variable

Definitions of different record types

Response codes for each variable

Codes used to indicate nonresponse and missing data

Exact questions and skip patterns used in a survey

Other indications of the content and characteristics of each variable

Additionally, codebooks may also contain:

Frequencies of response

Survey objectives

Concept definitions

A description of the survey design and methodology

A copy of the survey questionnaire (if applicable)

Information on data collection, data processing, and data quality

The ICPSR Guide to Codebooks discusses in detail the components of a codebook, and shows examples of variable level details from a wide variety of research codebooks.

The following example from ICPSR 9721 (Descriptors and Measurements of the Height of Runaway Slaves and Indentured Servants in the United States, 1700-1850) illustrates the main components of a typical ICPSR codebook:

The body of a codebook describes the content of the datafile and generally includes the following elements for each variable in the data file:

Variable Name: Indicates the variable number or name assigned to each variable in the data collection.

Variable Column Location: Indicates the starting location and width of a variable. If the variable is a multiple-response type, the width referenced is that of a single response.

Variable Label: Indicates an abbreviated variable description (maximum of 40 characters) to identify the variable for the user. In some cases, an expanded version of the variable name can be found in a variable description list.

Missing Data Code: Indicates the values and labels of missing data. If 9 is a missing value, then the codebook could note (MD=9). Alternative statements for other variables are "MD=8 OR GE 9" or "NO MISSING DATA CODES." Some analysis software packages require that certain types of data which the user desires to be excluded from analysis be designated as "MISSING DATA," i.e.., inappropriate, unascertained, unascertainable, or ambiguous data categories. Although these codes are defined as missing data categories, this does not mean that the user should or could not use them if so desired.

Code Value: Indicates the code values occurring in the data for this variable.

Value Label: Indicates the textual definitions of the codes. Abbreviations commonly used in the code definitions are "DK" (Do Not Know), "NA" (Not Ascertained), and "INAP" (Inapplicable).