Abstract

Missing data in large data analysis has affected further analysis conducted on dataset. To fill in missing data, Nearest Neighbour Method (NNM) and Expectation Maximization (EM) algorithm are the two most widely used methods. Thus, this research aims to compare both methods by imputing missing data of air quality in five monitoring stations (CA0030, CA0039, CA0042, CA0049, CA0050) in Sabah, Malaysia. PM10 (particulate matter with aerodynamic size below 10 microns) dataset in the range from 2003-2007 (Part A) and 2008-2012 (Part B) are used in this research. To make performance evaluation possible, missing data is introduced in the datasets at 5 different levels (5%, 10%, 15%, 25% and 40%). The missing data is imputed by using both NNM and EM algorithm. The performance of both data imputation methods is evaluated using performance indicators (RMSE, MAE, IOA, COD) and regression analysis. Based on performance indicators and regression analysis, NNM performs better compared to EM in imputing data for stations CA0039, CA0042 and CA0049. This may be due to air quality data missing at random (MAR). However, this is not the case for CA0050 and part B of CA0030. This may be due to fluctuation that could not be detected by NNM. Accuracy evaluation using Mean Absolute Percentage Error (MAPE) shows that NNM is more accurate imputation method for most of the cases.

Keywords:

1. INTRODUCTION

Air quality monitoring in Malaysia is continuously conducted by Department of Environment (DOE) and is done in stations around Malaysia (Dominick et al., 2012). These stations collect PM10 concentration data at one-hour interval. However, due to maintenance, calibration of monitoring instruments and power outage, data collected by monitoring stations may suffer missingness. Missing data mechanism can be categorized into three different types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) (Nakai and Ke, 2011). Missingness is categorized as MNAR when it depends on the missing value itself. MNAR is known to be non-ignorable and missing data due to MNAR is not possible to be recovered (Graham, 2009). On the other hand, missingness due to MAR depends on the observed data. MAR is ignorable and missing data can be recovered because its missingness does not depend on missing data itself. MCAR is a special case of MAR, where missingness is independent of both missing data and observed data (Dong and Peng, 2013). A set of data containing missing data due to MCAR can be considered as complete dataset because the missingness does not introduce bias (Dong and Peng, 2013). Little’s MCAR test can be used to determine whether the missingness is due to MCAR (Li, 2013). If the missingness is not MCAR instead, this test cannot be used to determine whether the missingness is due to MAR or MNAR (Dong and Peng, 2013). In terms of air quality data in Malaysia, missingness can be considered as MAR because the missingness is mainly caused by maintenance, calibration of monitoring instruments and power outage. It does not depend on whether the value of data is lower or higher than certain value. Missingness can affect further analysis that requires complete dataset such as Fourier analysis and principal component analysis.

Particulate matter (PM) is mixture of substances in the form of small particles suspended in the air. PM is one of the critical components of air pollution (Li et al., 2017b). Due to its small size, PM can enter respiratory system, thus becoming one of major concerns in public health (Chang et al., 2018). Because of this, scientific attraction has been attracted towards PM (Shahraiyni and Sodoudi, 2016). PM mainly comes from motor vehicles, dust from construction sites and landfills. It also comes from biomass burning and brought by haze, a typical challenge in Southeast Asia since 1980s (Shaadan et al., 2015). PM10 (particulate matter with aerodynamic diameter less than 10 microns) is one of major concern because it possesses hazardous properties towards human health compared to other pollutants such as carbon monoxide and nitrogen dioxide (Kim et al., 2015; Ny and Lee, 2010). This is because it can enter respiratory system while defending natural defences of human body (Chang et al., 2018). PM10 can increase risk of asthma, aggravate bronchitis, respiratory syncytial virus (RSV) bronchiolitis and other lung diseases (Carugno et al., 2018; Lelieveld et al., 2015). This is especially true for children aged between 5-15 years (Cadelis et al., 2014). Other than respiratory problems, cardiovascular disease and cancer can be developed due to PM10 in the air (Li et al., 2017a).

Many agencies around the world such as European Union (EU) and World Health Organization (WHO) implemented guidelines and set limit on air pollution concentration levels (Abd. Rani et al., 2018). In Malaysia, the guidelines are implemented by DOE. According to New Malaysia Ambient Air Quality Standard, PM10 concentration has its standard set to 50 μg/m3 (1-year averaging time) on 2015 before it is gradually lowered to 40 μg/m3 by 2020 (Department of Environment, n. d.). The implementation of this standard is important in order to ensure that air quality can be maintained at safe level. Therefore, there is a need to continuously monitor ambient air quality around Malaysia.

This research focuses on evaluating performance of data imputation on air quality data from five monitoring stations around Sabah. To make performance evaluation possible, missingness is introduced to compare observed data with imputed data. Two methods of data imputation are studied in this research, namely Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) algorithm. Many previous studies have employed nearest neighbour method and expectation-maximization algorithm to obtain complete dataset. However, not many of these studies emphasize on the efficiency of these two methods in data imputation. By comparing between both NNM and EM algorithm, further analysis that requires complete dataset can be made more accurate.

2. DATA AND METHODS

2. 1 Study Area and Data

Five monitoring stations (CA0030, CA0039, CA0042, CA0049, CA0050) in Sabah are listed in Table 1. Respective cities of each monitoring station are located as shown in Fig. 1. Except for CA0049, other monitoring stations are located at low altitudes and are close to the sea. Furthermore, Labuan (CA0050) is situated on a small island located at western of Sabah. As shown in Fig. 2, PM10 concentration in Sabah differs between seasons and location (Kanniah et al., 2016). Western coast of Sabah generally has higher PM10 concentration compared to other parts of Sabah all-year round. Also, PM10 concentration in Sabah is generally lower during intermonsoon October.

For the purpose of this research, 10-year hourly data from 2003 to 2012 are divided into two parts. The first part (Part A) ranges from 2003 to 2007, while the second part (Part B) ranges from 2008 to 2012. Due to climate change, trends of PM10 concentration data may differ from both parts. Thus, both parts may have difference in these data.

2. 2 Introduce Missingness to Data

In order to ensure that imputed data can be validated, a fraction of observed data must be replaced by missingness. Depending on complexity, missingness is introduced into data by percentage as conducted in previous research by Noor et al. (2014) as shown in Table 2. A sequence of zeros and ones (0 - do not replace observed data, 1 - replace observed data with missingness) is randomly generated using MATLAB 2018b and is used as a reference to introduce missingness to observed data. The actual percentage after introducing missingness may deviate by up to 2% due to existing missingness in the data.

EM algorithm employs a set of iterative equations to estimate mean vector and covariance matrix of multivariate distribution from exponential family (Junger and de Leon, 2015). This method maximizes log likelihood to find parameters when there are missing values (Nakai and Ke, 2011). The simplicity and smooth operation of EM algorithm makes it unique among present multiple imputation methods. In addition, its faster operation compared to the alternatives makes EM algorithm one of the most popular imputation methods (Abd Rani et al., 2018).

Given a set of data consisting of observed data Dobs and missing data Dmis, EM algorithm starts by defining parameter θ as a random value. Then, E-step (expectation step) calculates the likelihood of each values of Dmis for every missingness. M-step (maximization step) uses computed values of Dmis to find better estimation of θ. Given the likelihood function L and expected value of log likelihood function Q (θ|θ(t)), both E-step and M-step iterate until the value converges (Abd Rani et al., 2018). Both E-step and M-step are executed using equations (5) and (6).

where n is total number of data, Pi is predicted value of ith data, Oi is observed value of ith data, P¯ is mean predicted value, O¯ is mean observed value, sp is standard deviation of predicted values, and so is standard deviation of observed values.

3. RESULT AND DISCUSSION

3. 1 Performance Indicators

PM10 concentration datasets for five monitoring stations in Sabah are analysed. RMSE, MAE, IOA, and COD are calculated for every percentage of missingness and station for both part A and B. Tables 3 and 4 reveals performance indicators for NNM and EM at 5 missingness levels and 5 different stations for part A and part B respectively. The desirable attributes between these methods are highlighted in bold. In terms of missingness level, there is no definite relationship between performance of data imputation and missingness level. This is because both NNM and EM impute missing data based on available data. As long as available data is sufficient, missing data can still be effectively imputed.

Performance indicators for every station and missingness percentage for part B.

Most of the data show that nearest neighbour method is better imputation method. This may be due to the nature of missingness in relation to ability of EM algorithm to impute data. EM algorithm works best for missing data caused by MCAR (Nakai and Ke, 2011; Graham, 2009). However, air quality data collected in monitoring stations are not caused by MCAR as the cause of missingness is known. This may attribute to lower performance of EM algorithm compared to NNM.

However, this is not the case for CA0050, where most of the performance indicators for that station show that EM algorithm is a better imputation method. This may be due to the fact that Labuan is surrounded by sea. One study has shown that air humidity is affected by bodies of water due to high heat capacity and strong evaporation (Zhu and Zeng, 2018). Furthermore, cold-wet air that surrounds a water body enhances air flow away from bodies of water by changing the local air circulation (Zhu and Zeng, 2018). The local air circulation highly affects humidity in Labuan. Another study suggests that different levels of humidity affects PM10 concentration differently (Lou et al., 2017). PM10 concentration increases with humidity up to 60%. Beyond that point, gravity deposition occurs and PM10 concentration begins to drop (Lou et al., 2017). PM10 concentration as monitored by CA0050 may fluctuate due to continually changing of humidity level, traffic congestion and active industrial activity. This fluctuation is not accounted by NNM, leading to indication that EM algorithm is better imputation method for data collected by CA0050.

As for PM10 concentration read by CA0030, several performance indicators show that EM algorithm is better imputation method especially for part B of the data. This may be due to fluctuation of PM10 concentration in Kota Kinabalu especially between year 2008 and 2012. One study shows that PM10 concentration from 16th to 18th January 2012 spiked at 7.00 a.m. and fluctuates at the other time (Chang et al., 2018). When this portion of data is missing, NNM may not be able to restore the missingness as well as EM algorithm.

3. 2 Regression Analysis on Imputed Data

The performance of data imputation is further evaluated by calculating correlation of coefficient R on predicted data against observed data. The most ideal case of imputed data occurs when predicted data equals observed data (R=1). Tables 5 and 6 reveals coefficient of correlation of data in part A and B respectively, for all five missingness percentages and five stations.

Similar to performance indicators, coefficient of correlation shows that NNM is better imputation method for monitoring stations in Tawau, Sandakan and Keningau. As for CA0030, NNM is better imputation method for Part A, but not in Part B. Dataset recorded by CA0050 strongly suggests that EM algorithm is better imputation method.

Fig. 3 reveals scatter plot of data imputation for both CA0042 and CA0050. CA0042 and CA0050 are selected to be presented in the Fig. 3 because CA0042 is located at high altitude while CA0050 is located in a small island. The predicted-observed regression is shown for both stations due to different geographical condition in contrast to the other three stations. Coefficient of correlation for CA0042 shows relatively large difference between two methods compared to other stations. As shown in Fig. 3, all scatter plots for CA0042 shows that line representing NNM is closer to dashed line compared to line that represents EM algorithm. This shows that NNM has greater tendency to predict missing data closer to observed data compared to EM algorithm. This might be caused by missingness mechanism, in which data is Missing at Random. EM algorithm may not be able to impute MAR data as well as MCAR data (Nakai and Ke, 2011; Graham, 2009).

Scatter plot for imputation of data from CA0042 and CA0050 for part A and B at various missingness percentage. Blue indicates NNM, red indicates EM, while dashed line represents the point where predicted data equals observed data.

Meanwhile, CA0050 shows that EM algorithm gives better coefficient of correlation in contrast to other stations. Despite that, Fig. 3 reveals that NNM has either greater tendency (Part A) or approximately similar to EM algorithm (Part B) to predict missing data. This is because the lines representing NNM and EM are plotted at best fit. However, the scatter plot shows that imputed data by NNM for CA0050 are more dispersed away from line of best fit compared to that of CA0042, which might contribute to lower R value of NNM compared to EM algorithm. Although best fit line for NNM is closer to dashed line, the dispersion of scatter plot shows that EM algorithm is better imputation method compared to NNM.

3. 3 Mean Absolute Percentage Error (MAPE)

Performance of data imputation is further evaluated using MAPE. Data imputation is most accurate when MAPE approaches zero. Table 7 reveals accuracy of data imputation using NNM and EM for all stations and various level of missingness. According to Table 7, it is shown that NNM is generally more accurate data imputation method compared to EM (except for CA0050 in set B). This is reflected by lower values for NNM for most of the cases. This may be due to its ability to predict missing data closer to actual data compared to EM.

Mean absolute percentage error (MAPE) of stations in Sabah for various missingness level.

4. CONCLUSION

Generally, it has been shown that NNM is better imputation method for data from all the monitoring stations in Sabah except CA0050. NNM works most efficient for CA0049 in Part A (RMSE<14.302, MAE<10.640, IA>0.819 and COD>0.586) and CA0042 in Part B (RMSE<10.722, MAE<7.526, IA>0.835 and COD>0.632). This may be due to missing data type of MAR. However, strong fluctuation which may be present in data from CA0050 and part B from CA0030 may cause NNM to impute data not as well as EM algorithm. This may be further confirmed by regression analysis for CA0050 (R>0.711 for part B). Evaluation of accuracy using MAPE reveals that NNM is more accurate imputation method for most cases (except for set B in CA0050). This shows that NNM can be used as data imputation for missing data found in dataset observed by stations in Sabah. Accurate data imputation is important for future research because this enables further analysis on air quality data to become more reliable.

Acknowledgments

The authors would like to thank Universiti Malaysia Sabah for supporting this research by providing grant (SBK0324-2018, SGI0054-2018 and GUG0378-2018) and Department of Environment Malaysia for providing meteorological and pollutant data for research purpose.

Cadelis, G., Tourres, R., Molinie, J. (2014) Short-Term Effects of the Particulate Pollutants Contained in Saharan Dust on the Visits of Children to the Emergency Department due to Asthmatic Conditions in Guadeloupe (French Archipelago of the Caribbean). PLOS ONE, 9(3), 1-11..
[https://doi.org/10.1371/journal.pone.0091136]

Li, L., Liu, D.J. (2014) Study on an Air Quality Evaluation Model for Beijing City Under Haze-Fog Pollution Based on New Ambient Air Quality Standards. International Joutnal of Environment Research and Public Health, 11, 8909-8923..
[https://doi.org/10.3390/ijerph110908909]

Zhu, C., Zeng, Y. (2018) Effects of urban lake wetlands on the spatial and temporal distribution of air PM10 and PM2.5 in the spring in Wuhan. Urban Forestry and Urban Greening, 31, 142-156..
[https://doi.org/10.1016/j.ufug.2018.02.008]

Fig. 3.

Scatter plot for imputation of data from CA0042 and CA0050 for part A and B at various missingness percentage. Blue indicates NNM, red indicates EM, while dashed line represents the point where predicted data equals observed data.

Table 1.

Location of monitoring stations in Sabah.

Station ID

Station name

Latitude

Longitude

Altitude (m)

CA0030

SM Putatan, Kota Kinabalu

5.9804° N

116.0735° E

13

CA0039

Pejabat JKR Tawau, Tawau

4.2447° N

117.8912° E

12

CA0042

Pejabat JKR Sandakan, Sandakan

5.8394° N

118.1172° E

10

CA0049

SMK Gunsanad, Keningau

5.3374° N

116.1567° E

288

CA0050

Taman Perumahan MPL, Labuan

5.3441° N

115.2404° E

13

Table 2.

Percentage of missingness as conducted by Noor et al. (2014).

Degree of complexity

Percentage of missingness (%)

Small

5

10

Medium

15

25

Large

40

Table 3.

Performance indicators for every station and missingness percentage for part A.

Station

Missing-ness
(%)

Performance indicators

RMSE

MAE

IOA

COD

NNM

EM

NNM

EM

NNM

EM

NNM

EM

Remark: Data ranges from year 2003 to 2007

CA0030

5

18.130

17.161

11.765

11.699

0.760

0.649

0.495

0.509

10

17.022

16.864

11.279

11.728

0.782

0.646

0.536

0.505

15

16.887

16.672

11.422

11.715

0.784

0.651

0.540

0.503

25

16.862

16.205

11.496

11.488

0.782

0.660

0.538

0.505

40

17.362

16.412

11.745

11.504

0.769

0.660

0.521

0.506

CA0039

5

21.701

23.367

13.371

15.648

0.787

0.582

0.522

0.517

10

21.000

23.001

13.213

15.517

0.783

0.566

0.527

0.484

15

21.591

22.702

13.640

15.389

0.770

0.572

0.504

0.487

25

21.526

22.648

13.756

15.376

0.776

0.569

0.526

0.497

40

22.654

22.742

14.220

15.406

0.750

0.569

0.488

0.485

CA0042

5

15.712

17.525

10.841

11.761

0.804

0.511

0.595

0.370

10

15.393

16.923

10.609

11.637

0.801

0.530

0.586

0.397

15

15.229

16.543

10.588

11.574

0.795

0.544

0.577

0.407

25

14.930

16.151

10.506

11.510

0.798

0.556

0.585

0.428

40

15.328

16.313

10.824

11.656

0.792

0.553

0.574

0.425

CA0049

5

13.560

14.797

13.371

10.451

0.841

0.645

0.630

0.566

10

13.861

15.218

13.213

10.609

0.835

0.626

0.611

0.534

15

13.707

15.099

13.640

10.639

0.834

0.619

0.613

0.520

25

13.883

15.305

13.756

10.640

0.830

0.610

0.599

0.514

40

14.302

15.163

14.220

10.597

0.819

0.619

0.586

0.518

CA0050

5

14.937

13.258

10.666

10.095

0.719

0.676

0.482

0.473

10

15.685

13.314

10.856

10.093

0.702

0.665

0.451

0.467

15

15.552

13.161

10.818

9.953

0.698

0.664

0.453

0.464

25

15.251

13.266

10.752

9.903

0.711

0.663

0.469

0.471

40

15.239

13.351

10.843

9.957

0.707

0.661

0.468

0.467

Table 4.

Performance indicators for every station and missingness percentage for part B.

Station

Missing-ness
(%)

Performance index

RMSE

MAE

IOA

COD

NNM

EM

NNM

EM

NNM

EM

NNM

EM

Remark: Data ranges from year 2008 to 2012

CA0030

5

16.676

14.553

12.117

10.864

0.734

0.698

0.506

0.521

10

16.299

14.486

12.003

10.864

0.751

0.703

0.526

0.533

15

16.578

14.595

12.075

10.933

0.743

0.703

0.512

0.528

25

16.971

14.660

12.247

10.975

0.726

0.695

0.495

0.519

40

17.758

14.988

12.506

11.055

0.706

0.688

0.461

0.508

CA0039

5

13.996

18.965

9.867

15.145

0.860

0.666

0.661

0.535

10

14.383

18.737

9.977

15.062

0.846

0.672

0.629

0.531

15

14.365

18.913

9.994

15.164

0.849

0.668

0.647

0.528

25

14.602

19.104

10.187

15.258

0.852

0.669

0.654

0.522

40

15.484

18.984

10.753

15.246

0.830

0.673

0.632

0.530

CA0042

5

10.615

12.535

7.450

9.652

0.845

0.616

0.640

0.476

10

10.308

12.533

7.260

9.675

0.847

0.606

0.642

0.466

15

10.430

12.699

7.287

9.712

0.851

0.602

0.647

0.461

25

10.505

12.602

7.368

9.729

0.847

0.602

0.644

0.469

40

10.722

12.809

7.526

9.770

0.835

0.591

0.632

0.451

CA0049

5

18.255

17.987

9.867

12.430

0.756

0.529

0.457

0.400

10

16.754

17.404

9.977

12.274

0.780

0.551

0.492

0.428

15

16.966

18.046

9.994

12.457

0.777

0.530

0.486

0.399

25

16.771

18.225

10.187

12.574

0.780

0.521

0.495

0.394

40

17.564

18.143

10.753

12.629

0.758

0.523

0.462

0.396

CA0050

5

23.646

16.463

14.435

11.360

0.693

0.795

0.405

0.609

10

23.071

16.22

14.070

11.317

0.701

0.798

0.427

0.616

15

23.210

16.007

14.114

11.272

0.695

0.809

0.418

0.633

25

23.163

16.279

14.173

11.278

0.696

0.800

0.414

0.622

40

22.865

16.203

14.213

11.319

0.698

0.800

0.424

0.625

Table 5.

Coefficient of correlation for dataset in Part A.

Missingness
(%)

Station

CA0030

CA0039

CA0042

CA0049

CA0050

NNM

EM

NNM

EM

NNM

EM

NNM

EM

NNM

EM

Remark: Data ranges from year 2003 to 2007

5

0.593

0.569

0.633

0.575

0.653

0.406

0.714

0.604

0.524

0.363

10

0.624

0.547

0.626

0.535

0.648

0.429

0.707

0.580

0.504

0.507

15

0.626

0.552

0.606

0.539

0.639

0.438

0.704

0.561

0.497

0.504

25

0.622

0.555

0.613

0.541

0.644

0.456

0.698

0.553

0.513

0.514

40

0.602

0.560

0.575

0.537

0.634

0.449

0.680

0.558

0.506

0.512

Table 6.

Coefficient of correlation for dataset in Part B.

Missingness
(%)

Station

CA0030

CA0039

CA0042

CA0049

CA0050

NNM

EM

NNM

EM

NNM

EM

NNM

EM

NNM

EM

Remark: Data ranges from year 2008 to 2012

5

0.544

0.577

0.746

0.558

0.721

0.493

0.594

0.386

0.498

0.711

10

0.570

0.587

0.723

0.566

0.725

0.487

0.628

0.412

0.506

0.716

15

0.560

0.586

0.728

0.560

0.732

0.485

0.622

0.386

0.498

0.729

25

0.533

0.571

0.732

0.565

0.724

0.490

0.625

0.369

0.500

0.723

40

0.506

0.564

0.695

0.571

0.704

0.473

0.594

0.363

0.501

0.723

Table 7.

Mean absolute percentage error (MAPE) of stations in Sabah for various missingness level.