Note: Eight to ten years ago I was using the term “exceptionally large data” to identify what is today called “Big Data.” (Which for the most part is updated in this version.) Back then I also had to deal with the storage related difficulties experienced and the short life many of these projects and their results had unfortunately due mostly to storage space limits and costs (the post-Zip, pre-flashdrive era). This page and neighboring pages are from older teaching materials used for a lab on GIS and the corresponding lecture/discussion series on ‘GIS, population health surveillance, epidemiology and public health’. Some of the terminology may still be older in nature. This page is still quite valid, but is in need of some updates.

In recent months, the Big Data world has made an abrupt change in the direction being taken by businesses in possession of exceptionally large amounts of data. The first time I knew this change was taking place was back in 2007 when I had the chance to discuss with a data warehouse company what could be done with the information this company possessed. My professionally oriented goals were focused on population health. His corporate goals and desires were focused making better use of the company’s medical data. At the time I was already engaged in monitoring the success of education programs that were designed to improve health care practices, trying to write protocols and new statistical methods for testing physicians and their post-education activities pertaining to how much their practice changed as a result of their new knowledge. my hope of course was to document healthier populations as a result of engaging in these new protocols being taught and learned.

It was also about this time that I was already well aware of the new strategies being developed for producing a GIS based population health monitoring program. During the past few months in 2007, I was involved in working out the methods for grid mapping the health of the entire country, by breaking the country down into smaller, standardized units applicable to small area monitoring of population health features. A number of experiences I had in my previous GIS years were now working in my favor. I had already successfully developed the algorithms for more accurately defining the distribution of certain point-related metrics over time and space, based on a protocol that mathematically was more accurate than the traditional grid mapping programs.

I also had the experience to know the limits of small space or cell areas over much larger areas used to carry out certain research protocols. With this in mind I already suspected the limits to applying grid cell mapping to the entire U.S. I combined some of this thinking with my experiences mapping the nearly twenty year history of chemical release in Oregon, down to the place and chemical type. This data was in turn linked to medical data with the support of a series of grants rewarded to me for such a GIS-based research project. I learned during this time the value of grid mapping data versus point-mapping data. Due to error probabilities, the accuracy of point data was limited by GPS location data placement error for chemical exposure places, to which was added the error of the patient location data. This total error related to the grid cell sizes used for areal mapping development and analyses.

The typical measurements taken of health care systems and programs review similar features, relating health care opportunities and practices to the improvement of individual and population health, much like people with cancer are spatially correlated to their proximity to cancer-producing risk factors. Even though the underlying causality is different for the two, the spatial relationship evaluated between cause and effect is identical in terms of different spatial features being evaluated. The clustering of specific sicknesses, behavioral differences or problems, living arrangements, accessibility to health care features, all play important roles in the success of the health care program(s) made available to each particular person.

Over the past decade or more, health care programs evaluate geography mostly as a measure of care provided within the allowable distance of travel to care metric asked as part of every HEDIS or NCQA/Quality Assurance measurement program evaluated and reported by each facility monitored and health care coverage program. Individual health states, the ability of a program to manage those needed services, the specific spatial arrangement of specific systems related care management problems, the coursing effect of specific types of problems in need of more targeted forms of care and interventions, are all features that can more effectively be managed, monitored, and then compared with other programs by relying upon GIS and a national population health spatial database. With the addition of the spatial databases provided by Big Data companies, we can for the first time standardize the analysis of national population health data, basing our results and conclusions upon a much more credible and truthful representation of then national population health grid or layout.

Two Systems

Managed Care

Until recently, HEDIS/NCQA or NCQA-like activities have involved processes that were quite separate and different for the Big Data development, storage and utilization programs and products. In part this might be due to an ongoing dilemma that exists regarding HIPAA rules and regulations. The personal privacy regarding an individual’s health data has always been the reasons for these arguments. During many early attempts to employ GIS to managing, monitoring and statistically evaluating health care data, the ability of a third party, perhaps a reader of the final article written and published, to identify a particular person based upon a point on an area-specific map product has typically been the reason certain groups or agencies refuse to support important epidemiological research projects. Back in the late 1990s, this was not as much a limitation to similar studies engaged in in by other countries, such as Great Britain. During the early 2000s, studies were allowed in American educational settings,with HIPAA compliant protocols built into the research methods and presentations. In the traditional non-medical institutional setting, HIPAA was as much a concern and issue as it was in the health care research setting, such as the medical education facilities and institutions affiliated with that non-medical setting, with the exception that HIPAA often did not intervene as much in the final production and presentation process as it did within health care facility locations. Between 2000 and 2004, my presentations of Lyme disease data and cancer data did not raise as many concerns within academia, even in the state health department conference setting, as it might in a facility primarily active solely in the health care/health information world.

In 2006 this concern for committing a HIPAA violation even led many of mother best,research facilities and settings to either perform research and report its results only internally, or to modify the data including its mapping when presenting results outside the HIPAA-compliant professional settings. The irony felt by outsiders about this two-sided approach to allowing for spatial research to take place often resulted in disappointments felt mostly by those residing within the overly compliant state and federal agency settings. Each of use research teams could tell that the very specific data being presented on a map would be hard to convert to true street data. The standard common sense realization behind all of this was knowing that even if we knew someone on a particular street or even approximate address setting, that the case being present may not even be that person at all.

One of the first major successful attempts to deal with this address-patient matching concern was to rely upon presentations focused on areal results, such as block group and zip code area outcomes. This ensured the lack of a direct name being implicated so long as the condition wasn’t extremely rare. Such a method still suffices, and is used more so with zip code areas than census block groups (census blocks are still too revealing). In fact, quite a few articles have been published in recent years on this method of large area analysis and publication of these results in map form. Back in my early years of GISing health, I favored any small area technique of spatial analysis over the use of zip code data. This was due mostly to the greater accuracy of smaller area polygons like census block groups or blocks. But, if we take this irregular, large area problem,a allow for it but correct for it or know where this becomes a limiting a factor, thee remaining parts of an area mapped can be fairly evaluated using zip code data. For the health care field,d this means that zip code tracts can probably be,used effectively to study population health features, especially,during these early years of implementing this process as part of the Big Data movement to developing population health research methods.

Big Data.

The Big Data/Big Area movement or change in population health monitoring programs is a result of the successful application of Big Data to health-related activities and spatial features. In the long run,within the corporate world setting, it seems inevitable that applying Big Data to the overall health industry would be an expected outcome of the,development of large corporations in charge of so much data. In such cases, the development of new industries within such a sensitive electronic information defined environment makes it necessary for some sort of monitoring and regulation processes to commence. It is probably easier to manage an ongoing and steadily growing demand involving IT and IM mergers than to make it illegal to engage in such ventures.

Health Care Economics and the Insurance Industry was the natural way for this activity to commence. The development of large companies managing multiple insurance companies and the development of exceptionally large pharmacy benefits management companies made it inevitable for this type of information system management process and related regulatory quality assurance program to be developed. The primary motive for this change was not so much population health related controls, as it was financial loss and corporate failure problems due to fraud and abuse, overbilling, claims error habits, and other forms of improper billing or misappropriated clinical care, medical testing, follow up care and prescribing practices.

These responsibilities set the stage for clinically, public health-based monitoring activities to ensue, so long as the required skill set was there,ready and willing to test this hypothetical use of large claims, billing, clinical, laboratory, diagnostics, and pharmacy data. One of the primary limitations to apply Big Data generated resources to small program activities pertained to availability of this data, cost for data access, cost for related services, access to sufficient and reliable resources, the ability of both hardware and software products to produce outcomes with this data, the validity of these resources, and the value of these resources when applied to such uses.

The development of large data warehouses with large-scale information processing set-ups, running in parallel, is what made the evaluation of exceptionally-large datasets possible. From about 2007 to 2010, individuals like myself termed this processing exceptionally-large data management and data mining analyses. The logic to,data mining and exceptionally large data-management back then related to the time restraints exceptionally large data processing steps required. On a person a laptop or home PC, GIS spatial data processing should be anything from an hourlong project to a several hour or overnight analytics project. My first state related grid map of Oregon took six hours to produce the centroid data for; even though it would have helped, this process was not engaged in for the identical study performed on the nearby state of California due to state size and time constraints.

Integrative Data Management in Managed Care

We can apply Big Data to Managed Care by developing a program or system that makes use of three important datasets: the study group data, representative local or internal, institutional data, and base data or large area regional or national data. Based on past HEDIS evaluations and the NCQA methods of evaluation,we known there are specific regional differences demographically, especially in terms of age-gender and cultural and/or ethnic groups. We can presume for example that certain parts of this country have tendencies towards older age residents data due to tendencies for these people to remove to certain locations following retirement. We can also presume that two of the more important age-gender groups within a given population– the parent-child group members–have distributions that can differ between certain sized age-gender regions. Both HEDIS and NCQA studies have resulted in some redefining of large areas regions in the U.S. compared with standard U.S. census definitions of regions. These new regions can be further subdivided in some cases resulting in statistically significant differences in health and diagnostics measures.

More health care data integration. There are several avenues to take in order to relate large area data to local managed care data. Health care data integration is the current hot topic in the field of medical data management. The goal of companies devoted to health care integration is to produce an exceptionally large dataset that has applications for use at some regional or national level. These data sets can in turn be applied to small areas, small companies, small sets or classes of companies, and smaller insurance companies.

Better health care data quality. In general, the overall quality of the electronic health care data out there is good. There are examples of places, programs, people whop are behind the times in terms of understanding the proper use of fields on the data entry screen, adding personal notes here and there without thought of how that would be interpreted further down the road, but for the most part, data that is being added to later be integrated with other data is more in need of adequate manpower than major change. The limiter in the use of health care data is related more to its format and availability, the need to establish or better follow database development guidelines.

New health care data analysts. We need new analysts who can engage in new thinking for a new field with new potentials, not old time thinkers trying to resynthesize an old knowledge base to create new thoughts. A systems approach to reviewing the system itself might help. There are these tests for innovativeness that companies can perform that pertain to how much creativity and exploration are allowed within a business setting, and how much interactivity takes place amongst the most creative workers. Some companies do put up walls whenever and wherever new ideas come about. Others engage in the process of believing that only their methods are best, there is only one way to generate change or produce a certain set of results. New analysts are new because they don’t follow the same routine. They deconstruct and reshape progress rather than reconstruct and then reconstruct more the slowing means for progress from the past.

Problems and Limitations

Size differences. Size differences are the primary factor impacting how the statistics behave for different sized population groups being compared. With size differences and size change, two behaviors are worth noting when designing base level populations to compare the original population with.

The first issue relates to matchability. One of the older methods of comparing small groups being studied, such as with comparisons of pharmaceutical activities and success, involves the process of developing paired datasets. This method is applied mostly to very small population studies, by the tens, hundreds, thousands and sometimes tens of thousands of patients or cases, for comparison of outcomes for two separate research programs, such as a Control group versus a Non-control group. With populations numbering in the tends of thousands or more, such pairing in the past has not been a routine since the programming for this methodology was time consuming. Now it is possible in theory to develop a new model for engaging in such a process. The problem now with performing this routine is that the most basic assumption for this process becomes increasingly likely to incorporate errors in the data process routines due to human error. It is easier to check the conformity for 2,000 matched pairs than 20,000 matched pairs. With the latter we tend to opt for automated methods, in which we don’t make a visual inspection of the data, increasing the likelihood for a poor match.

W can also engage in grouped matched pairing routines,making each group somewhat comparable to the other in the other population. At the large scale population else, the problems with the prior example are duplicated and made more likely due to the simple fact that increasing numbers increases the likelihood that a poor outcome for the selection process will be produced.

The best way to avoid selection generated error is to not make use of the selection process at all, except for the following circumstances:

Region, beginning with census or NCQA defined regions, but with the possibility for applying an NCQA subregionalization method based upon outcomes for sub-region comparisons. The following subegionalization methods were shown to be most effective in exceptionally large databases.

[]

Major cultural subgroups. Cultural grouping becomes important when studies being performed were designed to involve some sort of ethnic, cultural, or socioeconomic-based analysis. The standard national datasets provide this information in the form of a race/ethnicity coding system applied to census data. Similar methods are already applied to many insurance program datasets, but this data can be a source of problems often due to the following problems.

Interpreting data is highly humanistic, phenomenological,and opinionated. Ethnicity/race does have a certain amount of the human, cultural value system related problems attached to its use. What one considers himself or herself to be versus what the system states qualifies each individual as are two very different outcomes. I can ask myself how healthy I am at the moment and give myself a score of 7 our of 10, but in reality, once the computer is asked to define my healthiness, and reviews all of my activities including my doctor visits, lab results, notes recording on exercise activity, most recent dietary patterns, events engaged in during my days off, my mother and father’s health, and their overall family history of disease patterns entered into the database as heredity data and ICDs, and the computer might say I only qualify as a 4 or 5 in terms of healthy lifestyle.

We cannot reasonably expect an individual to know his or her heritage or genetically linked cultural medical history, and so there is the problem of not being able to completely apply this metric across all possible research avenues. In the case of ethnicity related or even genetically-linked medical diagnoses, race and ethnicity may be an unknown for certain people with a positive diagnosis, making this application for such a metric unreliable. In other situations, a person may unknowingly believe he or she is of a certain sort of racial heritage, and yet not be biologically for any of a number of reasons. Finally, ethnicity is not always linked to complexion and anthropometric features often relied upon to “guess” race or ethnicity. One race is easily identifiable as such, many others aren’t. There are some races where other factors play a more important role in overall health. For some African cultures, tribal status prevails as the most dominant physical health related feature, not country of descent or place of birth.

Due to this humanistic effect of race and ethnicity on many potential health metrics, we must remain cautious but willing to actively apply new statistical research methods to evaluating certain physical health related conditions in terms of race and ethnicity linked social inequality related health practices taking place clinically. W can also cautiously relate such a metric to health related behaviors and conditions. The truest outcomes will not necessarily come from patient generated responses to race and ethnicity survey questions, but are more likely to come from well defined human genetics based testing methods for quantifying an individual’s physical make up.

The opposite is true for culturally-bound syndromes in relation to perceived heritage and chosen lifestyle status. Culture can be used to quantify culturally-bound syndromes more accurately than normally measured for studies of occupational health. The other metrics available can help us to further develop such metrics and improve upon their use in evaluating some of the most sensitive and culturally important health indicators for a given cultural group. Culturally-linked conditions can also be related to this method of population health analyses and summaries, and should be engaged in as part of a well-defined culturally-specific or targeted form of special population health analyses metrics. Such a process is defined in detail elsewhere at this site.

To make Big Data of use to managed care analysis, it has to be related to the managed care epidemiological and surveillance work. There are a number of features of data in general that related to how we can link Big Data to a local or regional health care monitoring system. The homogeneity of data, versus its heteroscedasticity, are the two biggest factors about data that make such methods success or fail. To date, the overall knowledge base out there about the impacts of each of these on outcomes reports has been extremely poor in the corporate world. There is this difference in statistical reliability, validity and credibility of business reporting, in the official referred and non-referred trade journals magazines, as well as in the internal corporate documents we use to base many of our financial decisions upon. In general, the corporate world doesn’t understand the difference between descriptive analysis and statistical significance analysis, and does not usually make any good use of standard deviations, CIs, and p values. These are reported, but typically ignored when numbers such as cost are reported. A cost value or rate of failure or success is considered interesting, maybe even reliable, but in the end is actually worthless and meaningless, and unfortunately will still be used to make important financial decisions at the reporting level and attempted cost-savings level. There are few leaders guiding the ship in such a setting, where important and costly decisions are being made. All of this is due to the misunderstanding of numbers and the misapplication or lack of use of the concepts of homogeneity and heteroscedasticity. You can’t rely upon one, without paying very close attention to the second, and knowing what it means.

Matching homogeneity. Homogeneity is the most important part of evaluating population health and then using your data for comparisons with a know group or control dataset. The major problem researchers are most apt to face with their various routines relates to the ability of a given population to be compared with a given control, base population, or automatically generated base population equivalent. The single most important feature defining a base population is its degree of difference with the sample population or actively covered, insured population in need of this review. Size is not the issues as much as each of the following demographic features: age, gender, ethnicity or cultural linkage(s), and place or region. The possibility of adding income data and a health risk score to this measurement process enhances not only the validity and reliability of the entire research methodology being put into place, it also helps to power of the uses for this population health monitoring program or tool.

In one method of analysis I developed, homogeneity is measured at its strongest peak range using a standardized algorithm I developed enabling two different population pyramids to be effectively compared with each other. The major limitations of these comparisons is based on homogeneity, which can be quantified by using a formula developed that measures differences between two,populations based upon the above metrics, followed by non-log and logarithmic methods of comparing the two population pyramids. Degree of differences can be evaluated to determine how similar each population is to the other,and how similar its most important subgroups area, mainly the income range, health range, and culture-ethnicity subgroups.

.

Heteroscedasticity. The measure of Income as a Heteroscedasticity is an important step in evaluating group differences within a population health study. Even though the source for health care coverage is identical between people, income differences effect such measures as average costs, costs for prescriptions, forms of health care delivery, the kinds of treatment and care an individual goes through, and the amount of care or length of time a particular person is covered and/or has to pay out to receive adequate if not best care. Heteroscedasticity as a rule does not exist in the other health care insurance groups. This is because of the inherent requirements for health care coverage by these programs, namely Medicare, Medicaid, CHP, homeless, and related low income family or member programs.

This Heteroscedasticity makes it a requirement that population comparisons be carried out using some form of algorithm designed to correct for this interfamily difference common to employee health and even high cost retired employees health insurance programs. The best ways to analyze Heteroscedasticity populations with exceptionally high health care cost members is to manage these high cost members and sometimes families as their own unique population health group.

Within any small health care program, there are going to be a few of these people for the numerous types of high cost care services out there. Like with other major classifications used to distinguish one set of people from the others, Medicaid, Medicare, and related insurance programs have to be differentiated from employee coverage, even though in terms of overall costs the two at first might appear to be very similar groups. If a comparison can be made between the employed and generally unemployed, low income sectors in terms of high cost special care, then such a study should be carried out. Each of these study, groups should then be compared with the regional or national population health statistics outcomes then defined.

These high cost patients also have to be broken down into acute versus chronic related causes for high cost. This is done to make this data analysis applicable to predictive modeling measures. Examples of high cost comparisons involve high pharmaceutical cost patient sets, high health care cost patient sets, and combined high pharmaceutical-health care cost patient sets. The first of the three is a group more apt to undergo cost-containment activities and even effective cost-reduction processes due to introduction of new generic forms of previously expensive medications and the like. The second group represents normally examples of care that will usually increase in costs over time, in spite of overall profs being made. In the best of circumstances, cost reductions ensue due to the discovery of a low cost, effective therapeutical procedure. More often ,it is expected that such new,procedures may reduce costs in terms of form of palliative care delivered, but still experience increases in price for all the typical reasons ranging from cost to become educated in this new practice, cost to perform, cost for required equipment to perform this therapeutic activity, each of these accompanied by expected rises in care costs due to the typical causes for these increases such as demand, cost for providing such coverage, and inflation.

Heteroscedasticity is primarily a private insurer/employee health related cost for care issue. But in the high cost, low income medical care environment, these cost differences between highly expensive illnesses versus others of lesser cost generate similar stresses upon the overall system. The markup that probably suffers the most from this dilemma is the medium income range group, whom lack the additional financial resources needed to result in affordable care, in spite of high costs. This means that high cost care has the potential of producing a significant burden on the overall system over time. Such burdens can be predicted using the high cost ICDs data, and determining the demographic and spatial requirements for this very unique metric. This particular topic will be evaluated separately at a later time.

The Benefits of Merging the Two Technologies.

The role of implementing effective programs using Big Data should be to produce effective base population data for producing more accurate large population health analyses. The availability of big data will also enable us to research rare items never before researched due to the scarcity of data for rare events. For this reason, the implementation of programs that produce standardized reports on Big Data population health statistics would be valuable to the health care system in general, and be very helpful in terms of defining value and efficacy for programs already in place. Fine tuning our big data down to the one-year age level enables us to avoid wasting time send out intervention mailings to too many people at too young or too old an age. Fine tuning big data enables new tactics to be used to contrast and compare related regions, using a GIS to define these regions. With Big Data, information covering larger numbers of patients or people, the possibility of mapping the results of this medical data at the national level is for the first time possible using isoline approaches rather than coarse areal, polygon illustrations that provide only range data.

As Big Data takes off in the information world, it is up to the business and medical world to catch up with this technology in general. Medical and public health industries have to catch up with GIS technology. These new forms of data utilization and analyses will allow us to make better use of the rare opportunity we now have to explore such a large amount of information for the first time. To those of us adept in these skills and the necessary knowledge base, this means we’ll be able to find answers to questions that could never be answered before, except in some virtual sense relying upon theory, statistical guesswork, and probabilities. Proof outshines theory, no matter what the argument may be.

Data mining and analysis for the public health field takes a lot of the guesswork out of the epidemiological statistics we have habitually relied upon for more than a half century. It allows us to answer those long lived questions we have had about rare diseases, people’s attitudes and behaviors about health, those unusual medical histories and behaviors that we all felt were impossible to exist. With GIS, we can attach place and therefore meaning to some of these answers for the first time as well. This means that any company avoiding this movement to merge Big Data opportunities with GIS statistical analysis opportunities is on the road to failure. Their competitors who take on such challenges are more than likely going to succeed over the next few years, due not only to innovation, but more importantly common sense. Big Data and Big Analysis go hand in hand, not Big Data and a recitation of much of what is already known, in the form of lengthier tables with bigger charts, graphs, and dollar values. New data elements, applications and results have to be explored and reported, not old data elements rehashed over and over without resulting in much change. If much of the new Big Data we have available to us remains underutilized by the health care industry, the potential for rapid progress will simply disappear.

As an individual trained and motivated in the better use of GIS for mapping out the entire health care industry, I know that changes can be made in the field. I don’t sense businesses have the background needed to produce such changes for now. Their desire to re-synthesize and shuffle around old office paper gets in the way of their potentials for creativity resulting in true change.

The following pages in this section review GIS applications and its use in improving upon the small area data analyses that make up much of the Quality Improvement world in health care. An ideal outcome of implementing this very local use of GIS to study public health would be the ability to relate these outcomes to the same outcomes generated by Big Data analysts and mappers. The method I have explored elsewhere, using a standardized grid approach to exploring small area medical data for the purpose of large area national population health data mapping, is one way to standardize this analysis and reporting method across the board. We have two other such methods of standardized reporting already in place–counties and zipcodes. Neither of these allow for small area analysis with the goal of improving upon prevention and control. For such an outcome to be generated, the small area grid mapping technique is the way to go.

Where to go from here

Big data provides the unique opportunity to analyze age, gender, ethnicity and race (AGER) features spatially. The AGER approach can be applied to rare genetic or race/ethnicity related physical and human behavior/mental health conditions. A standard run in big data analysis is the generation of profiles, outcomes, graphs and/or tables, with comparisons between key groups (i.e. White vs. Other, Black vs. Other, etc.), between ethnicity (Hispanic vs. not; Latino vs. the rest of hispanics, Japanese or Chinese vs. the rest, etc.) . In the United States, NCQA has a number of race-ethnicity studies developed, which take into account only chronic disease management and standard disease prevention methods, for system performance between races (i.e. breast and cervical cancer screening completions, for black vs. others, or hispanic vs. others, or diabetes outcomes for the quality improvement/meaningful use measures in black vs. white).

The following pages are highly recommended for any managed care organization that has one or more AGER groups being denied full, extensive health coverage.