Sains
Malaysiana 44(3)(2015): 449–456
A
Comparison of Various Imputation Methods for Missing Values in Air Quality Data
(Perbandingan Pelbagai Kaedah Imputasi bagi Data Lenyap untuk Data
Kualiti Udara)
NURYAZMIN AHMAT ZAINURI1*, ABDUL AZIZ JEMAIN2 & NORA MUDA2
1Fundamental Studies
of Engineering Unit, Faculty of Engineering and Built Environment
Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Darul
Ehsan, Malaysia
2School of
Mathematical Sciences, Faculty of Science and Technology
Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Darul
Ehsan, Malaysia
Received: 30 May 2013/Accepted: 21 August 2014
ABSTRACT
This paper presents various imputation methods for air quality
data specifically in Malaysia. The main objective was to select the best method
of imputation and to compare whether there was any difference in the methods
used between stations in Peninsular Malaysia. Missing data for various cases
are randomly simulated with 5, 10, 15, 20, 25 and 30% missing. Six methods used
in this paper were mean and median substitution, expectation-maximization (EM)
method, singular value decomposition (SVD), K-nearest neighbour (KNN)
method and sequential K-nearest neighbour (SKNN)
method. The performance of the imputations is compared using the performance
indicator: The correlation coefficient (R), the index of agreement (d) and the
mean absolute error (MAE). Based on the result obtained, it
can be concluded that EM, KNN and SKNN are the three best methods. The same result are obtained for all
the eight monitoring station used in this study.
Keywords: Imputation techniques; missing data; performance
indicators
ABSTRAK
Kertas ini membincangkan pelbagai kaedah
imputasi bagi rawatan data lenyap untuk data kualiti udara khususnya di
Malaysia. Objektif utama kajian ini
ialah memilih rawatan data lenyap yang terbaik dan juga perbandingan sama ada wujud perbezaan antara kaedah yang digunakan antara
stesen di Semenanjung Malaysia. Pelbagai kes data lenyap
telah dijana secara rawak iaitu dengan 5, 10, 15, 20, 25 dan 30% data lenyap. Enam kaedah rawatan data lenyap telah digunakan dalam kajian
ini iaitu teknik berasaskan min, median, jangkaan pemaksimuman (EM),
dekomposisi nilai tunggal (SVD), K-jiran terdekat (KNN)
dan K-jujukan jiran terdekat (SKNN). Pemilihan teknik imputasi terbaik adalah berdasarkan kepada
penunjuk prestasi yang menggunakan nilai pekali korelasi (R), indeks
persetujuan (d) dan min ralat mutlak (MAE). Berdasarkan
kepada keputusan yang diperoleh, dapat disimpulkan bahawa kaedah EM, KNN dan SKNN adalah tiga kaedah yang terbaik. Keputusan yang sama diperoleh bagi semua stesen yang
digunakan dalam kajian ini.
Kata kunci: Data lenyap; penunjuk prestasi;
teknik imputasi
REFERENCES
Allison, P.D. 2001. Missing Data. Sage Publications,
Inc.
Dempster, A.P., Laird, N.M. & Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1): 1-38.
Gelman, A., King, G. & Liu, C.
1998. Not asked and not answered: Multiple imputation for multiple surveys. Journal of the American
Statistical Association 93(443): 846-857.
Junninen, H., Niska, H., Tupprainen,
K., Ruuskanen, J. & Kolehmainen, M. 2004. Methods for imputation of missing values in air quality
data sets. Atmospheric Environment 38: 2895-2907.
Kim, K.Y., Kim, B.J. & Yi, G.S.
2004. Reuse of imputed data in microarray
increases imputation efficiency. BMC Bioinformatics 5: 160.
Laaksonen, S. 2000. Regression-based nearest neighbor hot
decking. Computational Statistics15(1): 65-71.
Little, R.J.A. & Rubin, D.B. 2002. Statistical Analysis with Missing Data. 2nd ed. New York: Wiley.
Plaia, A. & Bondi, A.L. 2006. Single imputation method
of missing values in environmental pollution data sets. Atmospheric Environment 40: 7316-7330.
Pollice, A. & Lasinio, G.J. 2009. Two approaches to
imputation and adjustment of air quality data from a composite monitoring
network. Journal of Data Science 7: 43-59.
Porter, J., Cossman, R. & James, W. 2009. Research note:
Imputing large group averages for missing data, using rural-urban continuum
codes for density driven industry sectors. Journal of Population Research 26(3):
273-278.
Rubin, D.B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P.,
Hastie, T., Tibshirani, R., Botstein, D., & Altman, R.B. 2001. Missing
value estimation methods for DNA microarrays. Bioinformatics 17(6):
520-525.
*Corresponding
author; email: yazmin@eng.ukm.my
|