Read an Excerpt

Statistical Analysis of Geographic Information with ArcView GIS And ArcGIS

By David W. S. Wong

John Wiley & Sons

ISBN: 0-471-46899-1

Chapter One

INTRODUCTION

1.1 WHY STATISTICS AND SAMPLING?

Attempts to understand, explain, estimate, or predict events or phenomena occurring around us often start by simplifying the information we have about them. In many cases, statistics have been devised and used to digest large quantities of information and to provide streamlined and concise impressions of the events or phenomena that we are trying to comprehend. For example, population counts of the 164 cities in Ohio would provide little meaning to us unless we know the largest, smallest, or average size among these cities or the range within which these city population sizes vary. In this case, the maximum, minimum, average, and the range of population counts are among the summary information that is known as statistics because they help to describe how values in a set of numeric information, or data, are distributed.

With this understanding, we can state that, given a set of numeric data, statistics are quantitative measures derived from the data to describe various aspects of the data. If they are classified by their functions, we have descriptive statistics and inferential statistics. Descriptive statistics are calculated from a set of data to describe how the values are distributed within the set of data. For example, the maximum, minimum, range, and average of a set of data are all in this category. Inferential statistics are calculated from sample data for the purpose of making an inference to a population or for making comparisons between sets of data. Depending on the areas of application, classical statistics, or conventional statistics, are generally used in different application areas, such as sociology, political science, medicine, and engineering. But these statistics have been modified and extended to accommodate specific application areas. In this book, we will include a great deal of statistics known as spatial statistics. These statistics are strongly based upon classical statistics but have been extended to work with data that are spatially referenced. Other statistics that are extensions of classical statistics for various application areas include econometrics, psychometrics, biostatistics, geostatistics, and several others. Certain statistics discussed in this book are sometimes classified as geostatistics, which originated in geoscience.

With statistics, an analysis can be performed to understand how data values concentrate or disperse around certain values, how they are compared with each other or with another set of data, or whether they are just subsets of a larger set of data. When analyzing data statistically, each observation should be independent so that its values or data are not dependent on, or tied to, values of other observations in the same data set. This independence assumption is one of the most fundamental assumptions in statistical analysis. Unfortunately, it is often violated for data collected to describe events or phenomena that are spatially referenced. This is because, in many geographic events or phenomena, what happened in a location is highly correlated to what happened in its surroundings. Because of this characteristic of spatially referenced data, much of our discussion in this book will focus on how statistics and associated methods can be modified to analyze spatially referenced data.

When one attempts to answer a scientific question, one will rarely draw conclusions based on just one or a few observations. For instance, if there was one case or a few cases of malaria in a community, can we say that there is a real epidemic or should we treat those occurrences as accidents or events occurring by chance? To take another example, can we conclude that the soil in a farm has lost its fertility if the farmer harvested much less crop this year than last year? Could the decline in yield be a one-time event or a short-term fluctuation? Will this happen again next year? Is soil fertility the only factor determining the amount of crop yield? Before any conclusions can be drawn, we need to understand the nature of these events or occurrences. In other words, when a certain phenomenon occurs, it may be due to a random process or a systematic process. We have to determine if the process is random or systematic. If an event or a phenomenon is triggered by a random process, there may not be much that we can do to identify its underlying cause in order to explain why the event or phenomenon happens the way that it does. But if it is part of a systematic process, the numeric or spatial patterns will be interesting to study and explore. As the first step in understanding these processes, statistical analysis is usually the tool used to help us decide if the events are random or not.

Using the soil example again, if we suspect that the soil fertility of a farm is low, and if this suspicion is based on some observations made in this farm, we are essentially formulating a hypothesis. This hypothesis can be tested to see if it should be rejected. To test this hypothesis, we would need to gather more information or data about the soil. Instead of just focusing on a small plot in the farm, we may want to examine different plots around the field for soil fertility levels. For a more rigorous study, we may want to drill holes at various locations in the field to collect soil samples to conduct a soil chemical analysis in a laboratory. By selecting different locations in the field to drill holes, we are essentially collecting a sample of soil for further examination instead of examining the entire population, which will require us to examine every location in the field. Each examined location can be regarded as an observation or a case in the sample, and the number of observations selected is known as the sample size. Similarly, by examining the same location over time and treating each examination of that location as an observation, we are collecting these observations from a population along the temporal dimension. Finally, the measured value from an observation is normally referred to as a data value. When there is a set of such values, they are referred to as a dataset.

After a sample of soil is assembled from different locations, a chemical analysis can be conducted to evaluate the levels of different chemicals, such as phosphorus, nitrogen, and potassium, in the sample. A measurement of each chemical can be derived by examining all observations in the sample, such as on average 30 mg of nitrogen per 1 kg of soil. This measurement is then a statistic, because it is derived from all the observations in the sample. If the data-gathering process covers the entire population, a similar measurement is derived from that process. This measurement is then known as a parameter. For instance, in the U.S. decennial census, certain questions were asked of all individuals in the United States in principle (we know that some people were missed due to the difficulty of reaching them-so-called undercounting). The measurements derived from those questions are parameters.

When analyzing a sample, a logical question to ask is, why should we examine a sample but not the entire population? Isn't it more accurate to enumerate the entire population? Of course, we would prefer to survey the entire population if we could. But often it is impossible and/or impractical for one or more of the following reasons:

1. The population is too large to be enumerated completely.

2. The cost of enumerating the entire population may be prohibitive.

3. The study requires a quick turnaround time, and studying the entire population may take too long.

4. If the enumeration process requires destroying the observations, such as in certain processes of quality control, then a full enumeration will destroy the entire population.

Using the soil study example again, it is impossible to evaluate the fertility level of every cubic foot of soil in the field for a complete examination. It is also very expensive and will take much too long to get the full result. In addition, if a hole is drilled for every location to gather the soil, there will be no soil left in the field. Therefore, sampling is often used instead of examining the entire population. For this reason, studying statistics becomes necessary.

The statistics on chemical levels that are generated from the soil sample may offer descriptive information about the condition of the soil in the field, including a numerical distribution of the chemical levels. Therefore, these statistics are regarded as descriptive statistics. How accurate the statistics are in describing the distribution of chemical levels in the entire field or in describing the population is dependent upon many factors. We know that these statistics will never be 100% accurate (since they are not from a complete survey of every inch of soil in the farm) and that the level of accuracy is dependent upon how representative the sample is of the population.

Fortunately, procedures have been developed, based on random processes, to allow us to draw conclusions on whether a sample is a reliable representative of the population or not. This process of drawing a conclusion about a population based on information derived from a sample is known as inference. The process of drawing an inference normally includes

1. formulating one or more hypotheses,

2. collecting relevant data by making observations,

3. computing descriptive or test statistics, and

4. deciding if the hypothesis should be rejected based on the computed statistics.

If sampling is desirable or preferred because an exhaustive survey of the population is not possible, then the sampling process should be carefully considered. But how should one select sample observations from the population? There are two general sampling schemes one may adopt: random sampling and systematic sampling. Random sampling is the process of selecting observations randomly from the population without any specific predefined structure or rules. Often, random numbers are used to assist the selection process. For example, items in an ordered set of objects are selected as samples if their positions correspond to those assigned by the random numbers. Alternatively, all objects in the set can be mixed up randomly before selection.

In contrast to random sampling, systematic sampling is the process of selecting observations based on certain rules developed according to certain principles. These principles are based on the objective(s) of the studies. Often one would like to adopt a sampling principle to cover the entire spectrum of the population. For instance, one may select every fifth observation from an ordered list of objects or select the households at the northwest corner of every street block in the city.

But sometimes a study may want to emphasize a specific segment or segments of the population, such as minority groups in the general population. For this purpose, sampling can be set up so that a particular minority group is sampled more than other groups. However, this should be done only with careful consideration of what the sample may represent and how it may affect the results because it is possible that those segments may be oversampled.

Within the two general sampling schemes, additional variations of the sampling process have been developed. For instance, observations sharing certain common characteristics may be grouped into different strata. With objects in different strata or groups, either random or systematic sampling can be performed within each stratum or group. This is called stratified sampling.

For example, selecting 30 cities from the 164 cities in Ohio may be performed in several ways. In random sampling, all 164 cities may be ordered or ranked by their population sizes. Next, we can select cities if their ordered positions match the first 20 random numbers from a random number table. Or we can select every 8th city until we have selected 20 from the list of 164 Ohio cities to perform the systematic sampling. Finally, we can use stratified sampling by first dividing the 164 cities into four groups based on their locations in northeast, northwest, southeast, or southwest Ohio and then selecting either randomly or systematically, 5 cities from each of the four groups to ensure that the sampled cities provide a good representation of Ohio cities over the entire state.

If the sampling of observations involves objects that have geographic references, more variations are needed to accommodate the geographic dimension. The sampling scheme that is designed to accommodate the sampling of observations in the geographic space is called spatial sampling. A good summary is available for further reading in Griffith and Amrhein (1991, pp. 215).

In the spatial sampling framework, locations are randomly selected to perform random sampling. When this process is implemented in a computer environment or with a Geographic Information System (GIS), the random locations are usually defined by the x-y coordinates taken from two sets of random numbers, as shown in Figure 1.1a. If the x-coordinates and y-coordinates are randomly determined, the resulting points defined by these x-y pairs are thought to be randomly distributed. In its simplest form, systematic sampling selects regularly spaced locations to ensure complete coverage of the entire study area, such as the structure shown in Figure 1.1b. Note that the distances between adjacent points are kept the same or approximately the same along the x- and y-directions only, not along the diagonals. If one prefers a spatial systematic sampling framework with observations regularly spaced, but with equal distances to their nearest neighbors, then the structure will be a triangular lattice, which resembles a hexagonal structure.

With these two general schemes of spatial sampling, we can create more variations. For example, we can combine random sampling with systematic sampling so that the geographical space is divided systematically but sampling is done randomly within each partitioned region. Of course, the partition of the geographical space should be mutually exclusive and collectively exhaustive. Figure 1.1c combines the systematic and random sampling frameworks by first dividing the entire region into subregions and then randomly selecting a point within each subregion.

One final note about spatial sampling is that our sampling unit so far is limited to locations in space, or points. There are, in fact, alternative sampling units. For example, Griffith and Amrhein (1991, p. 215) reviewed two other types of sampling units: linear units or traverses and areas. Sampling by areas will be discussed in Chapter 6, which deals with point pattern analysis. When Quadrat Analysis is used to analyze point patterns, the sampling areal units are known as quadrats.

1.2 WHAT ARE SPECIAL ABOUT SPATIAL DATA?

Techniques for statistical analysis have been very well developed and are widely used in many research fields and practical applications. However, most of the statistical techniques and models were developed not for observations with explicit geographic referencing information, but rather for data most likely compiled by selecting sample observations randomly from the population. When conventional statistical methods are used to analyze data derived from these observations, it is assumed that these observations and associated data can be considered independent. But for spatial data gathered from nearby observations or within the study region, these data tend to be related to each other. Thus, we cannot assume that observations are independent of each other. For this reason, using conventional statistical methods to analyze spatial data derived from these observations may cause problems.

(Continues...)

Excerpted from Statistical Analysis of Geographic Information with ArcView GIS And ArcGIS by David W. S. Wong Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Your Rating:

Your Recommendations:

Barnes & Noble.com Review Rules

Our reader reviews allow you to share your comments on titles you liked,
or didn't, with others. By submitting an online review, you are representing to
Barnes & Noble.com that all information contained in your review is original
and accurate in all respects, and that the submission of such content by you
and the posting of such content by Barnes & Noble.com does not and will not
violate the rights of any third party. Please follow the rules below to help
ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer.
However, we cannot allow persons under the age of 13 to have accounts at BN.com or
to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the
information on the product page, please send us an email.

Reminder:

- By submitting a review, you grant to Barnes & Noble.com and its
sublicensees the royalty-free, perpetual, irrevocable right and license to use the
review in accordance with the Barnes & Noble.com Terms of Use.

- Barnes & Noble.com reserves the right not to post any review -- particularly
those that do not follow the terms and conditions of these Rules. Barnes & Noble.com
also reserves the right to remove any review at any time without notice.