From the Statistical Research and Applications Branch (Huang, Stinchcomb, Pickle), and the Applied Research Program (Berrigan), Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland; and the College of Urban and Public Affairs (Dill), Portland State University, Portland, Oregon

Abstract

Background

There is an intense interest in the possibility that neighborhood characteristics influence active transportation such as walking or biking. The purpose of this paper is to illustrate how a spatial cluster identification method can evaluate the geographic variation of active transportation and identify neighborhoods with unusually high/low levels of active transportation.

Methods

Self-reported walking/biking prevalence, demographic characteristics, street connectivity variables, and neighborhood socioeconomic data were collected from respondents to the 2001 California Health Interview Survey (CHIS; N=10,688) in Los Angeles County (LAC) and San Diego County (SDC). Spatial scan statistics were used to identify clusters of high or low prevalence (with and without age-adjustment) and the quantity of time spent walking and biking. The data, a subset from the 2001 CHIS, were analyzed in 2007–2008.

Results

Geographic clusters of significantly high or low prevalence of walking and biking were detected in LAC and SDC. Structural variables such as street connectivity and shorter block lengths are consistently associated with higher levels of active transportation, but associations between active transportation and socioeconomic variables at the individual and neighborhood levels are mixed. Only one cluster with less time spent walking and biking among walkers/bikers was detected in LAC, and this was of borderline significance. Age-adjustment affects the clustering pattern of walking/biking prevalence in LAC, but not in SDC.

Conclusions

The use of spatial scan statistics to identify significant clustering of health behaviors such as active transportation adds to the more traditional regression analysis that examines associations between behavior and environmental factors by identifying specific geographic areas with unusual levels of the behavior independent of predefined administrative units.

Introduction

There is a recent and intense interest in the potential impact of neighborhood environments on energy balance–related health behaviors.1–4 This interest is due in part to the current epidemic of obesity and in part to continued interest in building healthy communities. One focus of this research has been on the relationship between the built environment and physical activity.1,5,6 A number of studies have demonstrated associations between diverse measures of urban form and active transportation or leisure activities such as walking and biking. However, the amount of variation in active transportation in these studies has been relatively small. In addition, the results are inconsistent, and the conclusions are based almost entirely on cross-sectional studies.6–8 Therefore, there is a significant need for new approaches to identifying the contextual and environmental correlates of active transportation in order to determine if this is a viable avenue for public health interventions and policy development.

Recent work on the environmental correlates of active transportation has largely involved regression analyses of the associations between active transportation and diverse demographic and contextual variables. Such approaches have provided insight into the relative contributions of density, diversity, design, and demography to levels of active transportation.9 Nonetheless, it seems possible that such approaches are limited by the ability of simple models to describe the effects of multiple contextual and environmental influences, particularly if these influences vary geographically. Identification of spatial clusters of active transportation may provide a tool to determine whether a set of environmental and demographic variables have been identified that can account for variation in active transportation as well as serve a policy function by characterizing regions lacking environmental characteristics that promote active transportation.

Visual inspection of maps has long been a tool in epidemiology.10,11 More recently, over 100 statistical tests12 have been proposed to identify significant overall clustering or specific clusters of disease cases or other variables of interest in spatial data sets. The aim of the current study is to identify localized clusters of active transport. SaTScan, a likelihood-based spatial scan statistic13 is good at detecting these clusters,14–18 and it has been used in past studies to search for clusters of breast cancer deaths,19 activity in birds,20 symptoms reported by patients at hospital emergency rooms,21 pharmacy sales, and physician visits.22 Notably, SaTScan has also been extended to identify elliptical as well as circular clusters of disease cases,23 grade and stage of cancer diagnosis,24 and survival rates.25

In the current article, the application of SaTScan is extended to identify clusters of behavioral data in a large survey where geographic identifiers are available. Specifically, spatial scan statistics are applied to search for clusters of neighborhoods with a high or low prevalence of walking or biking and for clusters associated with a higher or lower duration of walking and biking among walkers/bikers in a large sample of survey respondents in two diverse counties in California. To our knowledge, the current study is the first reported use of spatial scan techniques to search for clusters of active transportation.

Methods

A subset of data were analyzed in 2007–2008 from the 2001 California Health Interview Survey (CHIS). This survey is a large (N=55,428 households) random digit–dial telephone survey in California, administered in seven languages (English, Spanish, Mandarin, Cantonese, Vietnamese, Korean, and Khmer). The CHIS 2001 response rate, based on the American Association for Public Opinion Research equation RR4,26 was 43.3% with a cooperation rate of 63.7% (weighted to account for the sample design; 77.1% unweighted). More than 70% of survey respondents supplied the name of the nearest intersection to their residence (in Los Angeles County [LAC], 8728/12,196=71.5%; in San Diego County [SDC], 1952/2672=73%). These addresses were geocoded to represent the location of each respondent for purposes of this analysis.

After exclusion of respondents with missing or invalid data, 8506 respondents from LAC and 1883 respondents from SDC were used in the analysis (Figure 1A and 1D). Because LAC and SDC are not adjacent, the analysis was implemented separately in the two locations. The data from these two counties are the only data from CHIS 2001 with geocodes available. Data were analyzed from both counties in order to demonstrate the performance of the proposed statistical methods on larger (LAC) and smaller (SDC) samples from cities with distinct geographic, cultural, and economic characteristics. A combination of CHIS variables, census data, and street connectivity data generated from a GIS were used to create a working data set for the spatial cluster analysis and for subsequent characterization of active transportation clusters.

(A) Nearest intersection to respondents’ residences in Los Angeles County; (B) Clusters of high/low walking/biking prevalence in Los Angeles County, without age-adjustment; (C) Cluster of short walking/biking duration among walkers/bikers in Los...

California Health Interview Survey Variables

The CHIS 2001 data included in this study were measures of nonleisure-time walking and biking (NLTWB), and multiple relevant demographic variables. NLTWB was measured by asking three short questions: (1) Over the past 30 days, have you walked or bicycled to or from work, school, or to do errands?; (2) How many times per day, per week, or per month did you do this?; and (3) And on average, about how many minutes did you walk or ride your bike each time?. NLTWB was analyzed either as a measure of prevalence such as yes/no (any walking/biking or none) from the answers to the first question, or as a measure of duration, such as minutes per week spent walking/biking, derived from the answers to the second and third questions.

Demographic and SES variables including age, gender, race, education, and income were also extracted from the CHIS survey resource for each respondent, as were self-reported health status, immigration status, and employment status. For self-related health status, an activity-related variable was chosen based on responses to the query: How much does your health limit you when climbing several flights of stairs?. Responses were on a three-part scale: limited a lot, limited a little, not limited at all. CHIS includes other variables related to diet, tobacco and alcohol use, cancer-screening practices, and healthcare coverage; in the current study, the focus was on variables commonly used in past studies of active transportation.

Contextual Variables

The street connectivity and two density-related variables were analyzed using circular buffers (areas around a point) of radius 0.5 km surrounding each respondent’s location (nearest intersection to home). These buffers were defined using the Topologically Integrated Geographic Encoding and Referencing System (TIGER) map files from the 2000 U.S. Census Bureau and implemented with GIS software (ArcView, ESRI, Inc.). Data concerning population and employment density and characteristics of the street network for each buffer were then calculated using information from the U.S. Census at the census tract or census-block (administrative units that are nested within census tracts) level.

Population density within a buffer was generated by downloading U.S. Census data at the census-block level. Each half-kilometer buffer usually overlapped more than one census block. Population density was assumed to be uniform within each census block, and a portion of the population within a census block was assigned to the buffer based on the area of the census block within the buffer. For example, if a buffer covers half of a census block, half of the census block’s population is assigned to that buffer, in addition to the population in census blocks that was completely within the buffer. The total population in the buffer was then divided by the area (0.785 square kilometers). Employment density data were generated using data from the metropolitan planning organization for each area—the Southern California Association of Governments (SCAG) for Los Angeles and the San Diego Association of Governments (SANDAG) for San Diego. Each agency provided total employment data by census tract for the year 2000. The method to calculate employment density was identical to that for calculating population density, except that tracts were used instead of blocks. Therefore, the variances associated with population and employment densities are likely to differ in this study.

Four measures of street connectivity were also extracted: intersection density, street network density, census-block density, and average street block (segments of a street between intersections) length. The street network density is calculated by summing the lengths of all the streets within the buffer (the total street network distance within the buffer, ignoring the number of lanes on a road) and dividing by the area of the buffer (0.785 square kilometers). The portion of a street that continued outside the buffer was not included. Census-block density is the total number of census blocks within a buffer divided by the area of the buffer (0.785 square kilometers). Census-block boundaries generally coincide with streets. If a portion of a census block was outside a buffer, only the area of the census block within the buffer was included. The average street block length is the average length of the street blocks that are completely or partially within the buffer. For street blocks that continue outside the buffer, the entire length of the street block is included in the calculation. Truncating the street block at the buffer boundary would have reduced the length of the block artificially. Six buffers were randomly chosen, including two that overlapped, and connectivity estimates were validated by plotting the buffers with the links, nodes, and blocks and then manually extracting the data.

Spatial Analysis

Spatial clustering of the prevalence of walking or biking in LAC and SDC was separately analyzed using circular spatial scan statistics based on a Bernoulli model13 with walkers/bikers as cases and others as controls, with and without age-adjustment.16 Because SaTScan software allows adjusting for categoric variables with up to only 12 categories with a Bernoulli model, age was categorized into six levels (18–29, 30–39, 40–49, 50–59, 60–69, and ≥70 years), which were then used for age-adjustment in the Bernoulli model–based analysis scan. The length of time spent in active transportation by walkers/bikers was also explored using scan statistics based on a normal model (www.satscan.org) for continuous data (no covariate adjustment allowed with this model). The individual characteristics, buffer characteristics, and the street connectivity status within significant clusters were then examined by comparing the factors within the clusters to those outside the clusters using a t-test for continuous variables and a chi-square test for categoric variables.

A circular scan statistic searches one or more circles with a series of hierarchically overlapping areas of increasing size around each centroid (in this case the nearest intersection to the respondent’s home address). Detection of a significant cluster of high/low prevalence indicates that the people inside the cluster have a significantly higher/lower likelihood of walking/biking for transportation compared with the respondents in other areas. The maximum search window in the scan approach is user-selected. Smaller maximum search windows are useful to identify local clusters in more homogeneous neighborhoods. To select the appropriate maximum search window, the initial maximum search window was set to include 50% of the total number of respondents, and then successively smaller search windows were examined. Large clusters have a better chance of being declared significant (p<0.05) compared with small clusters because of their larger sample size. However, a large cluster may cover a large geographic area that includes local neighborhoods with diverse characteristics of individuals, buffers, and street connectivity. Ultimately, a maximum cluster size was selected that contained 3% of the population, because a smaller cluster size would contain a more homogeneous population, and almost all detected clusters with smaller sizes are within clusters found with a larger maximum size. Selection of maximum search windows generally requires consideration of study aims and some exploratory analysis such as the above.

Clustering of higher/lower prevalence in active transportation does not imply longer/shorter walking/biking time. Therefore, clustering of the length of time spent walking/biking among the walkers/bikers was also explored using the spatial scan method based on the normal model (www.satscan.org). The total sample size is smaller in this subanalysis because not all respondents reported any walking/biking for transportation (3573 in LAC, 680 in SDC). A 3% maximum search window size is too small for this case, especially for SDC analysis. Therefore, a minimum window size of two individuals and a maximum size of 10% of total walkers/bikers were used for the analysis of active transportation duration.

Results

Respondents from LAC and SDC have similar characteristics compared to the entire CHIS sample in California, except that the LAC/SDC sample is more racially/ethnically diverse (Table 1). Compared to a U.S nationwide survey, the combined study area and California as a whole are more racially/ethnically diverse, younger, have a lower average income, and have more immigrants, more college graduates, and more residents who did not graduate from high school. LAC has fewer non-Hispanic whites, more immigrants, and lower income and education levels than SDC. These comparisons are not relevant to the implementation of the spatial scan approach, but they may influence interpretation of the findings of such analyses.

Overall, the percentage of walking/biking and the average walking/biking time over all respondents in LAC is higher than it is in SDC (42.0% vs 36.1% and 84 vs 80 minutes per week, respectively). Without age-adjustment, there are more significant clusters detected in LAC than in SDC, perhaps because of greater spatial variation of active transportation or larger sample size in LAC (Figure 1B, 1C, and 1E). Five clusters with more walkers/bikers and two clusters with fewer walkers/bikers are detected in LAC using 3% of the number of total respondents as the maximum search window (Figure 1B). Cluster 5 is significant when looking for clusters of only more walkers/bikers; it is not quite significant (p=0.066) when testing for both more and fewer walkers/bikers. All other clusters are very significant (p<0.01). The relative risks of being a walker/biker were 1.7, 1.5, 1.6, 1.5, 1.5, 0.4, and 0.6 for Clusters 1 to 7, respectively. Three significant clusters of more walkers/bikers (p-values<0.05) and a marginally significant one with fewer walkers/bikers (p-value=0.06) were detected in SDC (Figure 1E) when searching for clusters of either more or fewer walkers/bikers. The relative risks of being a walker/biker in SDC were 2.2, 2.8, 2.0, and 0.09 for Clusters 1 to 4, respectively. Among walkers/bikers, average walking/biking times per week were 188 minutes (n=3573, SD=319) in LAC and 173 minutes (n=680, SD=232) in SDC. One area with lower walking/biking duration was detected in LAC (Figure 1C), but no significant clusters were detected in SDC, suggesting that the duration of activities among those walkers/bikers is spatially homogeneous or that there is a lack of power because of modest sample sizes, particularly in SDC.

To illustrate the capacity of the spatial scan approach to explore contributions of specific variables to clustering, the clustering pattern after adjusting for age was also evaluated. Because younger people tend to have more active transportation than older people, spatial heterogeneity in age distributions alone could result in spatial clustering of active transportation. In LAC, age-adjustment accounts for some but not all of the significant clusters of active transportation (Figure 1F vs 1B). The clusters with high prevalence in SDC were similar with and without age-adjustment (maps after age-adjustment not shown). However, the marginal cluster of low prevalence (No. 4 in Figure 1E) is no longer found after age-adjustment.

Characteristics of Clusters

The patterns of the other covariates in the analysis without age-adjustment (Table 2 and Table 3) were similar to those with age-adjustment (see Appendixes A and B, available online at www.ajpm-online.net). This discussion focuses on cluster properties for clusters identified without age-adjustment, allowing for description of age characteristics inside/outside clusters. Neighborhood characteristics were consistently different between high- and low-walking clusters in both counties, whereas comparisons of individual characteristics varied geographically (Table 2 and Table 3). Population and employment densities were higher in high-walking clusters and lower in low-walking clusters. Similarly, higher street, block, and intersection densities; shorter block lengths; and the presence of a bus route were associated with more walking and biking for transportation, compared to the rest of each county. Most of these street connectivity differences were highly significant.

Individual and contextual features associated with clusters of high and low walking prevalence in San Diego County, % unless otherwise indicated

People were slightly younger in the high-walking areas and older in the low-walking areas, but gender and BMI did not seem to be related to the pattern of active transportation. The associations between walking/biking and other individual characteristics generally varied among the clusters. In LAC, high-walking Clusters 1, 2, 4, and 5 had lower income and education levels, whereas Cluster 3 residents had higher education and income levels. Clusters 1, 2, and 5 had predominantly Hispanic residents, whereas Clusters 4 and 5 had more black residents; Cluster 3 is predominantly non-Hispanic white compared to areas outside the clusters (Table 2). In SDC, high-walking Clusters 1 and 3 had lower income and education levels; the small Cluster 2 had more highly educated residents but not a significantly different income distribution compared with the rest of the county. Cluster 1 had more Hispanic and black residents; Cluster 3 had more Hispanic residents; and Cluster 2 had more non-Hispanic whites than areas outside the clusters. In contrast, the low-walking clusters in both counties have more non-Hispanic white residents and higher income levels.

Self-reported health status showed mixed associations with walking/biking prevalence. For example, LAC high-walking Clusters 1, 2, 3, and 4 had fewer residents who responded that they were limited a lot in climbing stairs compared those to outside the clusters, but more residents had such limitations in Cluster 5. In Cluster 4, respondents reported the highest health status among all cluster areas and outside cluster areas (with 76.5% without limitation of climbing stairs). Street connectivity is very high in Cluster 4, so despite significantly fewer bus stops, freeways, and bus routes, it still appeared to be a cluster of more walkers/bikers, and their average duration walked/biked is the longest (264 minutes per week).

No significant clusters of longer or shorter walking/biking times were found in either county, independent of the prevalence of active transportation, but one area in LAC had a shorter walking/biking duration that was of borderline significance (Figure 1C). This area had residents with more education (86% vs 59% college education), higher income (75% vs 44% with income more than 300% of poverty level), fewer immigrants (23% vs 29%), and more non-Hispanic whites (75% vs 39%).

Although overall the length of time spent in active transportation did not cluster significantly in the study counties, the average walking/biking time did vary among the high- and low-prevalence clusters. The average time spent walking/biking among walkers/bikers in LAC Cluster 3 is lower than the average time spent outside the clusters (181 vs 192 minutes), whereas the time spent is longer inside than outside the other clusters (239, 193, 264, and 269 minutes for Clusters 1, 2, 4, and 5, respectively). Even though more people were found to walk/bike in SDC Clusters 2 and 3, their time spent is not long compared with those people who do walk/bike outside the clusters (118 and 158 vs 170 minutes per week). However, SDC high-walking Cluster 1 had a longer time spent walking/biking than the area outside clusters (263 minutes in Cluster 1 vs 170 minutes outside SDC clusters).

Discussion

The identification of disease clusters has been a prominent tool for health researchers at least since Snow’s analysis of cholera.11 Here, this approach is extended to the analysis of active transportation, a potential contributor to public health via the beneficial health effects of physical activity. Studies using traditional regression models have suggested that multiple factors may influence walking/biking. For example, transportation walking appears to be influenced by age, gender, auto ownership, and SES,9,27 as well as diverse environmental variables.28,29 In the current paper, it is demonstrated that spatial scan techniques can be used to find neighborhood scale clusters of frequency and duration of walking and biking in two California counties. This demonstration could help to identify unexplored correlates of walking through more detailed analysis of identified neighborhoods and to guide allocation of resources to specific neighborhoods.

Note that cluster detection and regression models are part of a logical sequence of analyses aimed at understanding the determinants of geographic variation in behavior or disease. The spatial scan statistics search for local clusters with high/low active transportation. The comparison here of the factors in/outside a particular cluster is the first step in an attempt to identify factors associated with active transportation. A traditional regression model is the next step in assessing the significance of these and other factors simultaneously. Although this type of model is preferable to the simple univariate comparison tests reported here, some effects in such a model may turn out to be nonsignificant if they have an impact on active transportation selectively in local areas rather than at a countywide level.

Application of the spatial scan method to CHIS data in LAC and SDC identified a number of significant clusters of high and low prevalence of walking or biking for transportation. The current examination of neighborhood characteristics supports previous studies27 suggesting that there are multiple factors that have an impact on physical activity at a very local level. There was no clear profile of a “high walking” place. High-prevalence clusters were consistently found to have significantly higher population and employment density and street connectivity, but the associations between more frequent walking/biking and race/ethnicity, population density, income, and even age varied across the significant clusters. The clusters of high walking prevalence identified in LAC include two communities on the ocean (Long Beach and Santa Monica) and three communities in the downtown and central regions of the city. This finding also supports the argument that walking/biking for transportation is influenced by a range of different demographic, economic, and environmental variables.

Many past studies of active transportation and the environment rely on data aggregated to administrative areas, such as counties, census tracts, or ZIP codes.3,6,9 A notable feature of the spatial scan approach is that it uses information about individuals to identify spatial variation in a disease or behavior rather than identifying administrative units differing in some population characteristic. Nevertheless, aggregate data from administrative units containing respondents will often be required for multilevel analyses, and selection of the appropriate unit for such data is critically important, as the degree of clustering and conclusions drawn can differ by geographic scale.30

Conventional power analyses are not possible for spatial scan approaches. Many analytical and simulation studies indicate that the power of detecting clusters varies by geographic features of the study region, the definition of the geographic unit, the strength of the clustering pattern as measured by disparity in relative risks inside/outside clusters, the geographic size and the population of the clusters, and the sample size (total number of events in the whole study region).14–18 There is no standard way to obtain a condition (such as a sample size) that achieves a pre-specified power level for this kind of analysis. However, it has been found18 that SaTScan has power greater than 90% to detect clusters with a relative risk of 1.5 when the sample size was 2500 (smaller sizes were not tested). In the current study, the sample size was much smaller in SDC (680) than in LAC (about 3600), which may be one of the reasons that more clusters were detected in LAC compared to SDC and that no cluster was found to be related to active transportation duration in SDC.

In contrast to other cluster-detection techniques (e.g., Getis-Ord Gi*,31 AMEOBA,32 ULS33), SaTScan is the only one that has the ability to adjust for covariates. To illustrate this function, an adjustment was made for age in the analysis of spatial clusters of active transportation in LAC and SDC. If a covariate of interest is not significantly different inside and outside a cluster, then adjustment will not influence the detection of the clusters. However, adjustment for covariates that do differ can alter the significance of a cluster or eliminate it, as it did for the LAC study in the current analysis, depending on the correlations among trait values of respondents constituting the cluster.

There are several limitations to this study. First, estimates of frequency and duration of walking and biking for transportation are based on self-reports. Such reports are likely to overestimate amount of walking.34 Second, data are available for locations of current residences without regard to how long the respondents had lived there or where they lived previously, which may have influenced their inclination to walk or bike. Data on important characteristics of the built environment that may influence residents to walk or bike for transportation, such as neighborhood land use mix,27,35 are also lacking. Third, this approach identifies circular clusters, but real “neighborhoods” or walking domains are likely to have more complex shapes.23,33,36 Fourth, different buffer sizes or shapes could give different neighborhood characteristics, which may lead to a slightly different summary of some of the variables in Table 2 and Table 3. Caution is required in the interpretation of the neighborhood level effects. It is well recognized that associations based on aggregated characteristics may not hold for individuals, but because the aggregation is over a small buffer area and is meant to describe conditions in the neighborhood rather than for the individuals who live there, this is likely only a minor problem. Fifth, the selection of the maximum window size in this study is subjective. Therefore, this kind of analysis provides only clues to the local relationship between active transportation and selected covariates. Other study designs are required to confirm these associations.

Despite these limitations, the spatial scan method is a promising approach to learning more about geographic variation in the prevalence of health behaviors. The method requires geographic identifiers at the individual level. Choice of a maximum window size for the cluster scan will be a compromise between including a population large enough to provide sufficient statistical power to detect significant clusters and small enough to identify local neighborhoods of interest consistent with the goals of the study. The spatial scan approach has the potential to be part of a toolkit for researchers to find new correlates of active transportation and other health behaviors and a method for policymakers to identify areas to serve as positive examples of walker/biker-friendly neighborhoods or neighborhoods that might provide opportunities for intervention.

Supplementary Material

01

Acknowledgments

We thank Bill Davis for advice on weighting of the CHIS data set, Robert Adamski for data extraction efforts, and several anonymous reviewers for helpful comments. We also acknowledge the CHIS team and CHIS respondents for creating a fascinating data resource. No financial disclosures were reported by the authors of this paper.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

26. American Association for Public Opinion Research. Standard definitions: final dispositions of case codes and outcome rates for surveys. 3rd Ed. Lenexa KS: American Association for Public Opinion Research; 2004.