African horse sickness (AHS) is a disease that is endemic to sub-Saharan Africa and is caused by a virus potentially transmitted by a number of Culicoides species (Diptera: Ceratopogonidae) including Culicoides imicola and Culicoides bolitinos. The strong association between outbreaks of AHS and the occurrence in abundance of these two Culicoides species has enabled researchers to develop models to predict potential outbreaks. A weakness of current models is their inability to determine the relationships that occur amongst the large number of variables potentially influencing the population density of the Culicoides species. It is this limitation that prompted the development of a predictive model with the capacity to make such determinations. The model proposed here combines a geographic information system (GIS) with an artificial neural network (ANN). The overall accuracy of the ANN model is 83%, which is similar to other stand-alone GIS models. Our predictive model is made accessible to a wide range of practitioners by the accompanying C. imicola and C. bolitinos distribution maps, which facilitate the visualisation of the model's predictions. The model also demonstrates how ANN can assist GIS in decision-making, especially where the data sets incorporate uncertainty or if the relationships between the variables are not yet known.

Introduction

Geographic information system (GIS) models were first applied in veterinary science in the late 1960s when a GIS model was applied to better understand the spread of foot-and-mouth disease in England.1 Since then, the application of GIS models in veterinary science has grown rapidly and currently extends to phenomena such as disease monitoring,2 biological risk management,3 scenario planning4 and animal health surveillance.1 A number of GIS models have also been developed to predict the occurrence in abundance of various species of insects including Culicoides spp. (Diptera: Ceratopogonidae), the vectors responsible for the transmission of the viruses that cause African horse sickness (AHS), bluetongue, epizootic haemorrhagic disease and equine encephalosis.5,6,7 Whilst GIS models have had some success in predicting the abundance of Culicoides in South Africa5 and Europe,6 these models have largely failed to determine the exact nature of the relationship occurring amongst the large number of variables that influence the occurrence of these vectors. This failure is as a result of the complicated nature of their studies, as well as the high number of variables typically employed to determine their presence. A potential solution to the problem of a large number of predictor variables and a complicated context lies in artificial neural networks (ANN).

The science of GIS is continually evolving and currently integrates and involves a wide array of subject areas. It is therefore difficult to succinctly define GIS and there currently exists no all-encompassing definition. With particular reference to this study, however, Davis8 defines a GIS as: 'a computer-based technology and methodology for collecting, managing, analysing, modelling and presenting geographic data for a wide range of applications'. Recent advances in computer technology allow decision-makers to deal with challenges with increasing levels of complexity. Because these challenges involve many uncertainties there is a limit to measure the effectiveness of these methods of analysis. This is also true in the case of GIS, a key support tool in decision-making.9 It is generally accepted that the effectiveness of a GIS model in decision-making is dependent on the appropriate integration of its key components - people, data, software, hardware and tools/analysis.10 Because these five components are influenced by the application area, van Helden argues that the application area should be added as a sixth component (Van Helden P 2005, personal communication, January 20). Over the past few decades the focus of GIS models has been on providing knowledge and understanding of spatial data, whilst their significance as a decision-making tool has often been overlooked.11 GIS combined with artificial intelligence can make a valuable contribution to decision-making, especially with recent advances in software that allow artificial intelligence to run from desktop computers, thus making it more accessible.

ANNs are a type of artificial intelligence based on how the human brain functions12 and have the ability to derive meaning from complicated and imprecise data.11,13 ANNs incorporate two important characteristics of the human brain: the ability to learn through examples and the ability to interpolate from incomplete information.14 Thus, ANNs can model extremely complex features and have emerged as an important tool for classification, a promising alternative to conventional classifiers.15 The actual network consists of neurons, called processors (nodes), which are connected by weighted links.14 The basic elements of an ANN consist of a number of inputs, from the original data set or from the output of other neurons, linked to a neuron via weighted links. Each neuron has a transfer function which, together with the weights, determines output. These basic elements (neurons, their inputs and outputs) are arranged to form a network. The most generalised network type consists of three separate layers: an input layer, a hidden layer and an output layer. Input to the network consists of raw data linked to the input layer, whose neurons connect with neurons in the hidden layer. The hidden neurons connect with the output neurons of the output layer, with each link having an associated weight.13 The output neurons of the output layer connect to a known output.16 The ANN is then trained using the input data and the known output to make predictions for unknown cases.

ANN models cannot solve all problems. They are best suited in situations where the different predictor variables are known but the exact nature of the relationship between them is unknown. ANN models are most appropriate in cases where the relationship between the different variables requires a complex, as yet undeveloped, mathematical model.17 ANN models have the added capability to extract patterns and trends from data sets too complicated for the human brain to recognise or for conventional computers to calculate.13 An additional benefit of ANN models is their capability to incorporate uncertainty or noise in their data sets.18

In this study we used a GIS model combined with an ANN model to predict the potential abundance of Culicoides spp., the insect vectors of the AHS virus. To our knowledge this is the first application of ANN models to predict the abundance of Culicoides spp. in South Africa or elsewhere. Our aim was to develop a model that predicts the abundance Culicoides imicola and Culicoides bolitinos, the carriers of AHS. This model specifically focuses on these two vectors because of their abundance near livestock, their host preference and their susceptibility to infectious diseases.7 Because factors such as climate, soil type, presence of water bodies, livestock density and irrigation5,6 all influence the occurrence in abundance of C. imicola and C. bolitinos, they were incorporated in the predictive model. In developing this model we also intended to illustrate how ANNs can assist GIS in developing decision support systems in which there are numerous uncertainties regarding the relationships that exist amongst the predictor variables.

Methods

Study area

This case study focused on the occurrence in abundance of C. imicola and C. bolitinos in the Western Cape province of South Africa. Until recently all provinces in South Africa were considered to be AHS-positive zones with outbreaks occurring frequently in the north-east of the country. This has led to legislation restricting the movement of horses both in and out of the country, as well as to restrictions on the hosting of international equestrian events. The economic consequences of these restrictions are dire. In an attempt to aid in the exportation and movement of horses, an AHS-free zone was designated around Cape Town in the Western Cape. The location of the AHS-free zone was based on the historical absence of AHS in the area despite the vector species occurring naturally in the region.

Data requirements

The following data sets were hypothesised to influence the occurrence in abundance of C. imicola and C. bolitinos in South Africa and were obtained for the period from December 2005 to December 2006.

Climate data

Daily weather data was obtained from the South African Weather Service and the Agricultural Research Council (ARC). The following variables were calculated from the daily data as an average for each month of the year for the study period: total, average, maximum and minimum rainfall; maximum, minimum and average of both the minimum and maximum temperatures; and maximum, minimum and average humidity. A 1-km raster surface was calculated for each of these variables using spatial interpolation.19

Long-term monthly minimum and maximum temperatures and rainfall were used to calculate anomalies or deviations from long-term averages. The long-term data were calculated by the ARC over a 20-year period.

Distribution of Culicoides imicola and Culicoides bolitinos

Total daily counts of C. imicola and C. bolitinos were obtained from the Entomology Division of the ARC-Onderstepoort Veterinary Institute. These were obtained as global positioning system (GPS) points which were subsequently imported and displayed in a GIS. These point samples were taken during the night by setting up back light traps which were placed near livestock stables (Venter GJ 2007, personal communication, February 15). As a result of the large number of environmental factors that can influence light trap efficiency, the number of Culicoides collected at each light trap should not be viewed as an accurate count of absolute numbers but rather as an indication of relative abundance (Venter GJ 2007, personal communication, February 15). The daily counts for C. imicola and C. bolitinos for most months were incomplete whilst no counts were recorded for June, July and October 2006. Whilst this does place some restrictions on the method, this was the most spatially replete (and available) data set by which to build the model and make inferences.

Monthly average and monthly total counts were calculated for each species individually and for both species combined. The combined monthly average result was used in the development of the model. A combined C. imicola and C. bolitinos count where the monthly average is greater than 1000 is regarded as an abundant population density (Venter GJ 2007, personal communication, February 15).

Clay areas and water bodies

The location of potential breeding sites for C. imicola6 - clay areas and water bodies - were obtained in electronic format from the Environmental Potential Atlas released by the Department of Environmental Affairs and Tourism. Because Culicoides spp. can easily spread as far as 2 km from their breeding sites,7 a 2-km buffer zone was created around all water bodies and clay areas and converted to a 1-km raster layer. Although Culicoides spp. can spread up to 700 km in windy conditions, the effect of wind was not incorporated into this model.

Normalised difference vegetation index and land surface temperatures

Land surface temperatures (LSTs) and normalised difference vegetation index (NDVI) data were obtained as 1-km grid raster images from the Moderate Resolution Imaging Spectroradiometer (Modis) website (http://modis.gsfc.nasa.gov/). NDVIs were obtained as monthly averages for the time period covered, whilst images with the lowest possible cloud cover per month for the LSTs were used as raster layers in the GIS.

Altitude

Altitude plays a significant role in the geographic distribution of C. imicola and C. bolitinos.5,6 To cater for this, a 1-km digital terrain model of South Africa (obtained from the Department of Geography, Geoinformatics and Meteorology, University of Pretoria) was used.

Livestock and field boundaries

The geographic distribution of livestock per magisterial district - indicating the total number of cattle, sheep, poultry and horses in a magisterial district - was obtained from the Directorate: Animal Health of the Department of Agriculture. The Western Cape province consists of 41 magisterial districts of sizes ranging from 16 580 km2 (Beaufort West) to 70 km2 (Goodwood). Livestock density for each magisterial district was calculated by dividing the total animal population for the district by its area. These values were then converted to a 1-km raster layer for use in the GIS. This 1-km raster layer is significant because C. bolitinos breed in cattle dung and both species are known to feed on large mammals.7 Wild animals also influence the abundance of C. imicola and C. bolitinos, but these data were not available.7

Field boundaries were obtained from the South African Department of Agriculture. No information regarding farming methods was available, so, for the purpose of this study, all cultivated fields were assumed to be irrigated. A 2-km buffer zone was assigned around all irrigated fields because C. imicola and C. bolitinos can easily spread 2 km beyond their breeding sites. These buffer zones were converted to a 1-km raster layer and imported into the GIS for further analysis.

Reported outbreaks of African horse sickness

The reported outbreaks of AHS were obtained from the Department of Agriculture, Chief Directorate: Food and Veterinary Services. In South Africa, AHS is listed as a controlled and notifiable animal disease and an outbreak of AHS must be reported to the abovementioned Chief Directorate. The outbreaks are recorded per farm but published per veterinary district.20 Because there is a strong association between outbreaks of AHS and abundance of the two Culicoides species, the predicted abundance was mapped against the actual outbreaks of AHS.7 Classification maps of the outbreaks per month for the time period were created to serve as a backdrop for the display of the predicted abundance of the Culicoides species.

The data sets described above were combined into a GIS model and stored per month from December 2005 to December 2006. As climate has a delayed effect of 15 to 30 days on the population growth of Culicoides,7 species counts for a specific month were combined with the NDVI and climate data for the previous month (Venter GJ 2007, personal communication, February 15). After incorporating the data into the GIS, the data were extracted for use in the ANN model using all the Culicoides capture sites as extraction points. Raster values for each capture site (or trap) were extracted for each layer and combined in a Microsoft Excel spreadsheet containing raster values for all the GIS layers per month (Figure 1).

The raster values obtained for climate, altitude, livestock density, NDVIs and LSTs were actual numerical values measured in the field or calculated using spatial interpolation. Raster values for the buffer zones calculated for the clay areas, water bodies and cultivated fields were assigned a value of 1 or 0: a value of 1 indicated that an extraction point was located within the 2-km buffer zone calculated for a relevant feature and indicated a high probability of an abundance of Culicoides, whereas a value of 0 indicated that the extraction point was located beyond the 2-km buffer zone and therefore there was a low probability of an abundance of C. imicola and C. bolitinos. The reason for using this binary classification was that the ANN is an exact classifier. The data extracted from the various GIS layers were combined in an Excel table, which became the input for the ANN.

The combined monthly average counts of C. imicola and C. bolitinos were used as the output variable in the ANN model. A total of 99 sites were sampled by the farmers in the Western Cape at regular time intervals from January 2006 to December 2006. Of a possible 1188 (12 x 99) counts, only 337 traps were set up by the farmers. All 337 records in the spreadsheet were checked to identify the minimum and maximum values for the variables for inclusion in the training set. The data set was then divided into training, verification and test sets. The training set consisted of 271 records (80%) selected to include all the identified minimum and maximum values of the variables. The training set was also selected to be geographically representative of the study area and included records from all four seasons for the year 2006. The training set included the verification set because the ANN software21 randomly selects a verification set from the training set during the training process. The test set consisted of the remaining 66 records (20%) which were used at a later stage to select the best ANN model for the prediction of an abundance of C. imicola and C. bolitinos.

In order to initiate the ANN training process, the training and verification sets were imported into the ANN software, which uses a feed-forward network with back propagation as a training algorithm. The training of the network started by including all the variables and with the number of epochs set at 50. The number of epochs was then increased and decreased together with changes in the momentum and learning rate until a minimum percentage misclassification on the training and validation sets was reached and the training then stopped immediately. The changes in the parameters also ensured that the ANN found a global minimum on the error surface. Some variables were then omitted and the process repeated. After training, those ANN models with the lowest percentage misclassified were selected and tested using the test set. The best predictive model was then selected and used to predict the abundance of C. imicola and C. bolitinos for the 851 trap points for which there were no counts for a particular month. The results from the ANN model were imported back into the GIS and a classification map of the abundance of C. imicola and C. bolitinos in the Western Cape was produced.

Results

The model with the best predictive capabilities was selected based on the highest percentage of correctly classified predictions. All variables that were initially included in the predictive model were found to be significant in the predictive accuracy of the abundance of C. imicola and C. bolitinos. The selected model was subsequently used to predict the abundance of C. imicola and C. bolitinos at trap points where there were no counts for particular months of the study period. The result of the prediction model was compared to the reported outbreaks of AHS and with the actual samples and counts that had been previously done.

The predicted abundance of C. imicola and C. bolitinos for January 2006 (Figure 2) for the George district coincided with an actual outbreak of AHS for the same month. The abundance of C. imicola and C. bolitinos predicted by the model for the Stellenbosch district for February 2006 (Figure 3) and March 2006 (Figure 4) coincided with actual counts. The predicted abundance for the same periods for the George district (Figures 3 and 4), and for March 2006 for the Robertson district (Figure 4) coincided with an outbreak of AHS for the same months.

There were outbreaks of AHS in the Murraysburg and Beaufort West districts (Figure 5) in April 2006, and in the Murraysburg, Beaufort West and Oudtshoorn districts in May 2006 (Figure 6). No predictions were made for the abundance of C. imicola and C. bolitinos in these districts. The possible reasons for the lack of predictions in these districts could be the unequal distribution of traps in these areas and the concomitant under-representation of these districts in the ANN model. Predicted values for the months of April and May 2006 also coincided with actual counts or were located in districts where outbreaks of AHS occurred.

The model predicted a zero probability of C. imicola and C. bolitinos in abundance for June, July and October 2006 for all the districts, and there were no outbreaks of AHS recorded during this time. In August 2006 there was only one counted abundance of C. imicola and C. bolitinos, in the Robertson district, and in September 2006 in the Stellenbosch district, and no predicted abundance. For the experimental results it was deemed unnecessary to predict for November 2006 because most of the traps had been set up by the farmers for C. imicola and C. bolitinos counts. The counts for this particular month were therefore complete.

Evaluation of the model

The predictive model selected in the study proved to be highly accurate in predicting the abundance of C. imicola and C. bolitinos, with a predictive capability of 83%, generally corresponding to that of the GIS model developed by Wittmann et al.6 Most of the variables included in the GIS models of Baylis et al.5 and of Wittmann et al.6 to predict the probable abundance of Culicoides spp. were included in the current predictive model. In Baylis et al.'s 5 GIS model, rainfall was found to not be a significant factor in predicting Culicoides spp.; whether or not rainfall was considered significant for the Wittmann et al.6 model was determined by the specific site for which abundance was predicted. The predictive model described here performed better as a predictor with rainfall included. This inclusion is supported by the research of Meiswinkel et al.7, which showed a clear link between Culicoides abundance and above-average rainfall. The predictive model also performed better when climate anomalies, not included in the GIS models mentioned above, were included.

The predictive accuracy of the model was also improved when livestock-density and field-boundary data5,7 were included. Field boundaries, however, do not indicate the farming practices, the insecticides used or the management of animal dung, all of which can influence the abundance of Culicoides spp.5 In contrast to the findings of Baylis et al.5, the use in this study of only NDVI and LST data to simplify the model resulted in poorer predictive capabilities than when the other data sets were included. Although the final predictive model included a considerable number of variables, data for these are readily available in electronic format, except that of C. imicola and C. bolitinos counts. Counts involve costly field work, a factor which may hamper further development of the model (Venter GJ 2007, personal communication, February 15). However, once the model is fully developed to include extreme minimum and maximum climate anomalies, traps to count Culicoides spp. should become unnecessary. The potential to remove the traps has not formed part of this study and may need to be examined in the future. It is important to note that the predictive model described here does not take into account the effect of wind conditions on the abundance and distribution of Culicoides, a shortcoming too of the GIS models developed by Baylis et al.5 and Wittmann et al.6 Prevailing winds influence the number of Culicoides individuals caught in traps: in strong winds Culicoides may become stationary and fewer will be trapped. On the other hand, they can travel long distances on prevailing winds and may cause outbreaks of diseases in unsuspected areas.7 Another limitation of this predictive model is the uneven distribution of Culicoides traps. There was a high density of traps in the Stellenbosch, Paarl, Malmesbury, George and Wellington districts, but no traps in the Prince Albert, Beaufort West, Murraysburg, Vanrynsdorp, Vredendal and Clanwilliam districts. So, whilst the ANN model is well trained to predict C. imicola and C. bolitinos in abundance for districts where there were many traps, it is poorly trained for areas with fewer or no traps. Where actual counts of Culicoides are evenly distributed throughout a study area it may be possible to predict abundance for the whole study area and not just at specific points. The model is not trained to make predictions in certain areas because there are no actual counts that can be used to train the ANN. A final challenge with the combined use of a GIS and ANN to predict species distribution is the lack of a direct interface between the models: a high level of software knowledge and computer training is required.

Although the predicted abundance of C. imicola and C. bolitinos were compared to the actual outbreaks of AHS it should be kept in mind that the vaccination of horses against AHS can suppress these outbreaks and may skew the correlation of AHS with C. imicola and C. bolitinos numbers. However, the positive correlation indicates that abundance for the area was predicted and an early warning system could have prevented some outbreaks.

Conclusions

In this study we successfully developed a model using GIS and ANN to predict the abundance of C. imicola and C. bolitinos in the Western Cape of South Africa. In doing so we were able to predict the abundance of species at sites where no actual counts were made. These predictions can also be used for subsequent years provided that anomalies in monthly temperatures or rainfall are minor.

The complementary use of ANN and GIS to predict the occurrence in abundance of Culicoides is encouraging. By extrapolation, these models can be used to anticipate potential outbreaks of diseases like bluetongue, epizootic haemorrhagic disease and equine encephalosis,7 which emphasises the importance of this predictive model internationally. By using this model as an early warning system to predict the abundance of these vectors, timeous action can be taken to protect animals at risk and thereby lessen the impact of the disease.19

This method furthermore demonstrates how ANN models can assist in decision-making, especially where the data sets incorporate uncertainty or where the relationships amongst the variables are unknown. The results of this study are encouraging and provide a rich set of scenarios for further research and multidisciplinary applications. Exploration of the complementary use of exact GIS models and nonparametric methods, such as ANN models, provides considerable scope for other applications and multidisciplinary research.

Acknowledgements

We thank the ARC-Onderstepoort Veterinary Institute, Agricultural Research Council for supplying the Culicoides spp. counts used in this study and the Directorate: Veterinary Services, Department of Agriculture, Forestry and Fisheries for continued support and evaluation.