Sains Malaysiana 44(10)(2015): 1531–1540
Application
of Functional Data Analysis for the Treatment of Missing Air Quality Data
(Aplikasi
Analisis Data Fungsian untuk Merawat Data Kualiti Udara yang Lenyap)
NORSHAHIDA SHAADAN1*,
SAYANG MOHD DENI1
&
ABDUL AZIZ
JEMAIN2
1Center for Statistical
and Decision Science Studies, Faculty of Computer & Mathematical Sciences
Universiti
Teknologi MARA (UiTM), 40450 Shah Alam, Selangor Darul Ehsan,
Malaysia
2DELTA, School of Mathematical
Sciences, Faculty of Science & Technology, Universiti Kebangsaan Malaysia (UKM),
43600 Bangi, Selangor Darul Ehsan, Malaysia
Diserahkan: 26 Mac 2014/Diterima:
15 Jun 2015
ABSTRACT
In most research including
environmental research, missing recorded data often exists and has become a
common problem for data quality. In this study, several imputation methods that
have been designed based on the techniques for functional data analysis are
introduced and the capability of the methods for estimating missing values is
investigated. Single imputation methods and iterative imputation methods are
conducted by means of curve estimation using regression and roughness penalty
smoothing approaches. The performance of the methods is compared using a
reference data set, the real PM10 data
from an air quality monitoring station namely the Petaling Jaya station located
at the western part of Peninsular Malaysia. A hundred of the missing data sets
that have been generated from a reference data set with six different patterns
of missing values are used to investigate the performance of the considered
methods. The patterns are simulated according to three percentages (5, 10 and
15) of missing values with respect to two different sizes (3 and 7) of maximum
gap lengths (consecutive missing points). By means of the mean absolute error,
the index of agreement and the coefficient of determination as the performance
indicators, the results have showed that the iterative imputation method using
the roughness penalty approach is more flexible and superior to other methods.
Keywords: Air quality; functional
data; imputation; missing value; PM10
ABSTRAK
Dalam
kebanyakan penyelidikan termasuklah penyelidikan alam sekitar, data lenyap
sering wujud dalam rekod dan telah menjadi masalah lazim terhadap kualiti data. Dalam kajian ini, beberapa kaedah imputasi yang berasaskan teknik
analisis data fungsian telah dicadangkan dan kebolehan kaedah tersebut dikaji. Kaedah imputasi tunggal dan kaedah imputasi ulangan telah
dijalankan dengan pendekatan penganggaran lengkuk menggunakan teknik pelicinan
regresi dan teknik denda kekasaran. Prestasi kaedah-kaedah imputasi
dibandingkan menggunakan data set rujukan cerapan sebenar pencemar PM10 yang
telah direkodkan di stesen pemantau kualiti udara Petaling Jaya yang terletak
di bahagian barat Semenanjung Malaysia. Untuk mengkaji prestasi kaedah imputasi
yang dicadangkan, sebanyak seratus data set dijana untuk setiap enam paten data
lenyap yang berbeza menggunakan data rujukan. Paten
kelenyapan data disimulasi mengikut tiga jumlah nilai peratusan kelenyapan (5,
10 dan 15) dengan dua saiz maksimum panjang turutan kelenyapan (3 dan 7) (titik
lenyap berturut). Dengan kaedah min ralat mutlak, indeks persetujuan dan
nilai pekali penentu sebagai penunjuk prestasi, keputusan analisis kajian mendapati
bahawa kaedah imputasi ulangan yang menggunakan pendekatan denda kekasaran
adalah lebih fleksibel dan lebih baik daripada kaedah yang lain.
Kata kunci: Data fungsian; imputasi; kualiti udara; nilai lenyap;
PM10
RUJUKAN
Acuna, E. &
Rodriguez, C. 2004. The treatment of missing values and its
effect in the classifier accuracy. In Classification, Clustering and
Data Mining Applications, edited by Banks, D., House, L., McMorris, F.R.,
Arabie, P. & Gaul, W. Berlin Heidelberg: Springer. pp. 639-648.
Baraldi, A.N. &
Enders, C.K. 2010. An introduction to modern missing data
analyses. Journal of School Psychology 48: 5-37.
Cao, Y., Poh, K.L. &
Cui, W.J. 2008. A non-parametric regression approach for
missing value imputation in microarray. In Intelligent
Information Systems XVI. Proceedings of the
International IIS’08 Conference. pp. 25-34.
Chen,
J., Li, E., Lau, A., Cao, J. & Wang, K. 2010. Automated load curve
data cleansing in power systems. IEEE Transaction Smart Grid 1(2):
213-221.
Conte,
S.D., Dunsmore, H.E. & Shen, V.Y. 1986. Software Engineering
Metrics and Models. Menlo Park, California, USA: The Benjamin/Cummings
Publishing Company.
Craven,
P. & Wahba, G. 1979. Smoothing noisy data with spline functions: Estimating the
correct degree of smoothing by the method of generalized cross validation. Numeriche
Mathematik 31: 377-403.
Gao, H.O. &
Niemeier, D.A. 2008. Using functional data analysis of diurnal ozone and NOx
cycles to inform transportation emissions control. Transportation. Research
Part D 13: 221 - 238.
Huang, J.Z. & Shen,
H. 2004. Functional coefficient regression models for non-linear time series: a
polynomial spline approach. Scandinavian Journal of Statistics 31:
515-534.
Junninen,
H., Niska, H., Tuppurainen, K., Ruuskanen, J. & Kolehmainen, M. 2004. Method for imputation
of missing values in air quality data sets. Atmospheric Environment 38:
2895-2907.
Malek, M.A., Harun, S.,
Shamsuddin, S.M. & Mohamad, I. 2008. Reconstruction of missing daily
rainfall data using unsupervised artificial neural network. World Academic
of Science Engineering and Technology 44: 616-621.
Martinez,
J., Saadvera, A., Garcia-Nieto, P.J., Pineiro, J.I., Iglesias, C., Taboada, J.,
Sancho, J. & Pastor, J. 2014. Air quality parameters outliers detection using functional data analysis in the Langreo urban area (Northern
Spain). Applied Mathematics and Computation 241: 1-10.
Park,
A., Guillas, S. & Petropavlovsikh, I. 2013. Trends
in stratospheric ozone profiles using functional mixed model. Atmospheric
Chemistry and Physics 13: 11473-11501.
Pighin, M. &
Ieronutti, L. 2008. A methodology supporting the design and
evaluating the final quality of data warehouse. Int. J. Data
Warehouse Min. 4(3): 15-34.
Plaia, A. & Bondi,
A.L. 2006. Single imputation method of missing values in environmental
pollution data sets. Atmospheric Environment 40: 7316-7330.
Police,
A. & Lasinio, G.J. 2009. Two approaches to imputation method of missing
values in environmental pollution data sets. Journal of Data Science 7:
43-59.
Preda,
C., Duhamel, A., Picavet, M. & Kechadi, T.I. 2005. Tools for statistical
analysis with missing data: Application to a large medical database. In Connecting
Medical Informatics and Bio-Informatic Proceedings of MIE 2005, edited by
Engelbrecht, R., Geissbuhler, A., Lovis, C. & Mihalax, G. ENMI. pp.
181-186.
Quintela-del-Rio,
A. & Francisco-Fernandez, M. 2011. Nonparametric functional data estimation
applied to ozone data: Prediction and extreme value analysis. Chemosphere 82:
800-808.
R
Development Core Team. 2008. R: A language and environment for statistical
computing. R Foundation for Statistical Computing,Vienna,
Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Ramsay, J.O. &
Silverman, B.W. 2006. Functional Data Analysis. 2nd ed. New York: Springer.
Ramsay, J.O., Hooker, G.
& Graves, S. 2009. Functional Data Analysis with R and
Mathlab. New York: Springer.
Ruggieri,
M., Plaia, A., Salvo, F.D. & Agro, G. 2013. Functional
principal component analysis for the explorative analysis of multisite-multivariate
air pollution time series with long gaps. Journal of Applied
Statistics 40(4): 795-807.
Shaadan,
N., Jemain, A.A., Latif, M.T. & Deni, S.M. 2015. Anomaly detection and
assessment of PM10 functional data at several locations in the Klang Valley,
Malaysia. Atmospheric Pollution Research 6: 365-375.
Shaadan, N., Deni, S.M. &
Jemain, A.A. 2012. Assessing and comparing PM10 pollutant behavior
using functional data approach. Sains Malaysiana 41(11): 1335-1344.
Smolinski, H. &
Hlawiczka, S. 2007. Chemometric treatment of missing elements in air quality
data sets. Pollution Journal of Environmental Studies 16: 613-622.
Torres, J.M., Nieto,
P.J.G., Alejano, L. & Reyes, A.N. 2011. Detection of
outliers in gas emissions from urban areas using functional data analysis. Journal of Hazardous Material 186: 144-149.
Wilmott, C.J., Ackleson,
S.G., Davis, R.E., Feddema, J.J., Klink, K.M., Legates, D.R., O’Donnell, J.
& Rowe, C.M. 1985. Statistics for the evaluation and
comparison of models. Journal of Geophysical Research 90(C5):
8995-9005.
Zhang, S. 2011.
Shell-neighbor method and its application in missing data imputation. Applied
Intelligent 35: 123-133.
*Pengarang untuk surat-menyurat; email:
shahida@tmsk.uitm.edu.my
|