Sains Malaysiana 44(10)(2015): 1531–1540

 

Application of Functional Data Analysis for the Treatment of Missing Air Quality Data

(Aplikasi Analisis Data Fungsian untuk Merawat Data Kualiti Udara yang Lenyap)

 

NORSHAHIDA SHAADAN1*, SAYANG MOHD DENI1 & ABDUL AZIZ JEMAIN2

 

1Center for Statistical and Decision Science Studies, Faculty of Computer & Mathematical Sciences

Universiti Teknologi MARA (UiTM), 40450 Shah Alam, Selangor Darul Ehsan, Malaysia

 

2DELTA, School of Mathematical Sciences, Faculty of Science & Technology, Universiti Kebangsaan Malaysia (UKM), 43600 Bangi, Selangor Darul Ehsan, Malaysia

 

Diserahkan: 26 Mac 2014/Diterima: 15 Jun 2015

 

ABSTRACT

In most research including environmental research, missing recorded data often exists and has become a common problem for data quality. In this study, several imputation methods that have been designed based on the techniques for functional data analysis are introduced and the capability of the methods for estimating missing values is investigated. Single imputation methods and iterative imputation methods are conducted by means of curve estimation using regression and roughness penalty smoothing approaches. The performance of the methods is compared using a reference data set, the real PM10 data from an air quality monitoring station namely the Petaling Jaya station located at the western part of Peninsular Malaysia. A hundred of the missing data sets that have been generated from a reference data set with six different patterns of missing values are used to investigate the performance of the considered methods. The patterns are simulated according to three percentages (5, 10 and 15) of missing values with respect to two different sizes (3 and 7) of maximum gap lengths (consecutive missing points). By means of the mean absolute error, the index of agreement and the coefficient of determination as the performance indicators, the results have showed that the iterative imputation method using the roughness penalty approach is more flexible and superior to other methods.

 

Keywords: Air quality; functional data; imputation; missing value; PM10

 

 

ABSTRAK

Dalam kebanyakan penyelidikan termasuklah penyelidikan alam sekitar, data lenyap sering wujud dalam rekod dan telah menjadi masalah lazim terhadap kualiti data. Dalam kajian ini, beberapa kaedah imputasi yang berasaskan teknik analisis data fungsian telah dicadangkan dan kebolehan kaedah tersebut dikaji. Kaedah imputasi tunggal dan kaedah imputasi ulangan telah dijalankan dengan pendekatan penganggaran lengkuk menggunakan teknik pelicinan regresi dan teknik denda kekasaran. Prestasi kaedah-kaedah imputasi dibandingkan menggunakan data set rujukan cerapan sebenar pencemar PM10 yang telah direkodkan di stesen pemantau kualiti udara Petaling Jaya yang terletak di bahagian barat Semenanjung Malaysia. Untuk mengkaji prestasi kaedah imputasi yang dicadangkan, sebanyak seratus data set dijana untuk setiap enam paten data lenyap yang berbeza menggunakan data rujukan. Paten kelenyapan data disimulasi mengikut tiga jumlah nilai peratusan kelenyapan (5, 10 dan 15) dengan dua saiz maksimum panjang turutan kelenyapan (3 dan 7) (titik lenyap berturut). Dengan kaedah min ralat mutlak, indeks persetujuan dan nilai pekali penentu sebagai penunjuk prestasi, keputusan analisis kajian mendapati bahawa kaedah imputasi ulangan yang menggunakan pendekatan denda kekasaran adalah lebih fleksibel dan lebih baik daripada kaedah yang lain.

 

Kata kunci: Data fungsian; imputasi; kualiti udara; nilai lenyap; PM10

RUJUKAN

Acuna, E. & Rodriguez, C. 2004. The treatment of missing values and its effect in the classifier accuracy. In Classification, Clustering and Data Mining Applications, edited by Banks, D., House, L., McMorris, F.R., Arabie, P. & Gaul, W. Berlin Heidelberg: Springer. pp. 639-648.

Baraldi, A.N. & Enders, C.K. 2010. An introduction to modern missing data analyses. Journal of School Psychology 48: 5-37.

Cao, Y., Poh, K.L. & Cui, W.J. 2008. A non-parametric regression approach for missing value imputation in microarray. In Intelligent Information Systems XVI. Proceedings of the International IIS’08 Conference. pp. 25-34.

Chen, J., Li, E., Lau, A., Cao, J. & Wang, K. 2010. Automated load curve data cleansing in power systems. IEEE Transaction Smart Grid 1(2): 213-221.

Conte, S.D., Dunsmore, H.E. & Shen, V.Y. 1986. Software Engineering Metrics and Models. Menlo Park, California, USA: The Benjamin/Cummings Publishing Company.

Craven, P. & Wahba, G. 1979. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross validation. Numeriche Mathematik 31: 377-403.

Gao, H.O. & Niemeier, D.A. 2008. Using functional data analysis of diurnal ozone and NOx cycles to inform transportation emissions control. Transportation. Research Part D 13: 221 - 238.

Huang, J.Z. & Shen, H. 2004. Functional coefficient regression models for non-linear time series: a polynomial spline approach. Scandinavian Journal of Statistics 31: 515-534.

Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J. & Kolehmainen, M. 2004. Method for imputation of missing values in air quality data sets. Atmospheric Environment 38: 2895-2907.

Malek, M.A., Harun, S., Shamsuddin, S.M. & Mohamad, I. 2008. Reconstruction of missing daily rainfall data using unsupervised artificial neural network. World Academic of Science Engineering and Technology 44: 616-621.

Martinez, J., Saadvera, A., Garcia-Nieto, P.J., Pineiro, J.I., Iglesias, C., Taboada, J., Sancho, J. & Pastor, J. 2014. Air quality parameters outliers detection using functional data analysis in the Langreo urban area (Northern Spain). Applied Mathematics and Computation 241: 1-10.

Park, A., Guillas, S. & Petropavlovsikh, I. 2013. Trends in stratospheric ozone profiles using functional mixed model. Atmospheric Chemistry and Physics 13: 11473-11501.

Pighin, M. & Ieronutti, L. 2008. A methodology supporting the design and evaluating the final quality of data warehouse. Int. J. Data Warehouse Min. 4(3): 15-34.

Plaia, A. & Bondi, A.L. 2006. Single imputation method of missing values in environmental pollution data sets. Atmospheric Environment 40: 7316-7330.

Police, A. & Lasinio, G.J. 2009. Two approaches to imputation method of missing values in environmental pollution data sets. Journal of Data Science 7: 43-59.

Preda, C., Duhamel, A., Picavet, M. & Kechadi, T.I. 2005. Tools for statistical analysis with missing data: Application to a large medical database. In Connecting Medical Informatics and Bio-Informatic Proceedings of MIE 2005, edited by Engelbrecht, R., Geissbuhler, A., Lovis, C. & Mihalax, G. ENMI. pp. 181-186.

Quintela-del-Rio, A. & Francisco-Fernandez, M. 2011. Nonparametric functional data estimation applied to ozone data: Prediction and extreme value analysis. Chemosphere 82: 800-808.

R Development Core Team. 2008. R: A language and environment for statistical computing. R Foundation for Statistical Computing,Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

Ramsay, J.O. & Silverman, B.W. 2006. Functional Data Analysis. 2nd ed. New York: Springer.

Ramsay, J.O., Hooker, G. & Graves, S. 2009. Functional Data Analysis with R and Mathlab. New York: Springer.

Ruggieri, M., Plaia, A., Salvo, F.D. & Agro, G. 2013. Functional principal component analysis for the explorative analysis of multisite-multivariate air pollution time series with long gaps. Journal of Applied Statistics 40(4): 795-807.

Shaadan, N., Jemain, A.A., Latif, M.T. & Deni, S.M. 2015. Anomaly detection and assessment of PM10 functional data at several locations in the Klang Valley, Malaysia. Atmospheric Pollution Research 6: 365-375.

Shaadan, N., Deni, S.M. & Jemain, A.A. 2012. Assessing and comparing PM10 pollutant behavior using functional data approach. Sains Malaysiana 41(11): 1335-1344.

Smolinski, H. & Hlawiczka, S. 2007. Chemometric treatment of missing elements in air quality data sets. Pollution Journal of Environmental Studies 16: 613-622.

Torres, J.M., Nieto, P.J.G., Alejano, L. & Reyes, A.N. 2011. Detection of outliers in gas emissions from urban areas using functional data analysis. Journal of Hazardous Material 186: 144-149.

Wilmott, C.J., Ackleson, S.G., Davis, R.E., Feddema, J.J., Klink, K.M., Legates, D.R., O’Donnell, J. & Rowe, C.M. 1985. Statistics for the evaluation and comparison of models. Journal of Geophysical Research 90(C5): 8995-9005.

Zhang, S. 2011. Shell-neighbor method and its application in missing data imputation. Applied Intelligent 35: 123-133.

 

 

*Pengarang untuk surat-menyurat; email: shahida@tmsk.uitm.edu.my

 

 

sebelumnya