Sains Malaysiana 43(10)(2014): 1599–1607
Imputing
Missing Values in Modelling the PM10 Concentrations
(Mengganti Nilai Hilang dalam Pemodelan Kepekatan PM10)
NURADHIATHY ABD RAZAK1, YONG ZULINA ZUBAIRI2*
& ROSSITA M. YUNUS3
1Institute of Graduate
Studies, University of Malaya, 50603 Kuala Lumpur, Malaysia
2Centre for Foundation
Studies in Science, University of Malaya
50603 Kuala Lumpur, Malaysia
3Institute of
Mathematical Sciences, University of Malaya, 50603 Kuala Lumpur, Malaysia
Received: 30 July 2013/Accepted: 13 February 2014
ABSTRACT
Missing values have always been a problem in analysis. Most
exclude the missing values from the analyses which may lead to biased parameter
estimates. Some imputations methods are considered in this paper in which
simulation study is conducted to compare three methods of imputation namely
mean substitution, hot deck and expectation maximization (EM)
imputation. The EM imputation is found to be superior
especially when the percentage of missing values is high as it constantly gives
low RMSE as compared with other two methods. The EM imputation
method is then applied to the PM10 concentrations
data set for the southwest and northeast monsoons in Petaling Jaya and Seberang Perai,
Malaysia which has missing values. Four types of distributions, namely the
Weibull, lognormal, gamma and Gumbel distribution are
considered to describe the PM10 concentrations. The
Weibull distribution gives the best fit for the southwest monsoon data for Petaling Jaya. The lognormal distribution outperformed the
others in describing the southwest monsoon in Seberang Perai. Meanwhile, for the northeast monsoon in both
locations, gamma distribution is the best distribution to describe the data.
Keywords: Expectation maximization; mean imputation; missing
value; PM10; Weibull
ABSTRAK
Nilai hilang selalu menjadi masalah dalam analisis. Kebanyakan mengabaikan nilai hilang ini daripada analisis yang mungkin menyebabkan kepincangan dalam anggaran parameter. Beberapa kaedah gantian dipertimbangkan dalam kertas kerja ini dengan kaedah simulasi telah dijalankan untuk membandingkan kaedah-kaedah gantian tersebut iaitu penggantian menggunakan min, geladak panas dan jangkaan pemaksimuman (EM). Gantian EM didapati yang terbaik terutama apabila peratus nilai hilang adalah tinggi kerana ia berterusan memberi RMSE yang rendah berbanding dua kaedah yang lain. Kaedah gantian EM ini kemudiannya diaplikasikan pada set data kepekatan PM10 bagi monsun barat daya dan timur laut di Petaling Jaya dan Seberang Perai, Malaysia yang mempunyai nilai hilang. Empat jenis taburan, iaitu taburan Weibull, lognormal, gama dan Gumbel dipertimbangkan untuk menggambarkan kepekatan-kepekatan PM10. Taburan Weibull memberi kesesuaian terbaik untuk data monsun barat daya bagi Petaling Jaya. Taburan lognormal pula mengatasi yang lain dalam menggambarkan monsun barat daya di Seberang Perai. Manakala bagi monsun timur laut di kedua-dua kawasan, taburan gama adalah taburan yang terbaik yang menggambarkan data tersebut.
Kata kunci: Jangkaan pemaksimuman; min gantian; nilai hilang; PM10;
Weibull
REFERENCES
Allison, P.D. 2001. Missing Data. California:
Thousand Oaks, Sage.
Barzi, F. & Woodward, M. 2004. Imputations of missing values in practice: Results from
imputations of serum cholesterol in 28 cohort studies. American Journal of
Epidemiology 160: 34-45.
Clark, T.G., Bradburn,
M.J., Love, S.B. & Altman, D.G. 2003. Survival Analysis Part IV: Further concepts and methods in survival analysis. British
Journal of Cancer 89: 781-786.
Dempster, A.P., Laird, N.M. & Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B(Methodological) 39: 1-38.
Department of Statistics. 2011. Population distribution and basic demographic
characteristics 2010. http://www.statistics. gov.my/portal/. Assessed on 29
November 2011.
Dominici, F., McDermott, A., Zeger, S.L. & Samet, J.M. 2003. National maps of the effects of particulate matter on
mortality: Exploring geographical variation. Environmental Health
Perspectives 111: 39-43.
Fitri, M.D.N.F., Ramli, N.A. & Yahaya, A.S. 2011. Extreme value
distribution for prediction of future PM10 exceedences. International Journal of
Environmental Protection 1: 28-36.
Fitri, M.D.N.F., Ramli, N.A., Yahaya, A.S., Sansuddin, N., Ghazali, N.A. & Al Madhoun,
W. 2010. Monsoonal differences and probability distribution
of PM10 concentration. Environmental Monitoring Assessment 163: 655-667.
Jamal, H.H., Pillay, M.S., Zailina, H., Shamsul, B.S.,
Sinha, K., Zaman Huri, Z., Khew,
S.L., Mazrura, S., Ambu,
S., Rahimah, A. & Ruzita,
M.S. 2004. A Study of Health Impact & Risk Assessment
of Urban Air Pollution in Klang Valley, Malaysia. Kuala Lumpur: UKM Pakarunding Sdn Bhd.
Junninen, H., Niska,
H., Tuppurrainen, K., Ruuskanen,
J. & Kolehmainen, M. 2004. Methods for imputation of missing values in air quality
data sets. Atmospheric Environment 38: 2895-2907.
Lu, H.C. 2004. Estimating the emission source reduction of
PM10 in central Taiwan. Chemosphere 54:
805-814.
Majlis Perbandaran Petaling Jaya. 2005. Maklumat Asas Petaling Jaya. Petaling Jaya: Majlis Perbandaran Petaling Jaya.
Norazian, M.N., Shukri, Y.A., Azam, R.N. & Mustafa Al Bakri, A.M. 2008. Estimation of
missing values in air pollution data using single imputation techniques. ScienceAsia 34: 341-345.
Noor, N.M., Tan, C.Y., Abdullah,
M.M.A., Ramli, N.A. & Yahaya,
A.S. 2011. Modelling of PM10 concentration in industrialized area in
Malaysia: A case study in Nilai. 2011
International Conference on Environment and Industrial Innovation IPCBEE, Vol.12.
Singapore: IACSIT Press.
Noor, N.M. & Zainudin, M.L.
2008. A review: Missing values in environmental data sets. In Proceeding of International Conference on Environment.
Noor, N.M., Yahaya,
A.S., Ramli, N.A. & Abdullah, M.M.A. 2006. The replacement of missing values of
continuous air pollution monitoring data using mean top bottom imputation
technique. Journal of Engineering Research & Education 3:
96-105.
Sansuddin, N., Ramli,
N.A., Yahaya, A.S., Fitri,
M.D.N.F., Ghazali, N.A. & Al Madhoun,
W.A. 2011. Statistical analysis
of PM10 concentrations at different locations
in Malaysia. Environmental Monitoring Assessment 180: 573-588.
Schafer, J.L. & Graham, J.W. 2002. Missing data: Our
view of the state of the art. Psychological Methods 7: 147-177.
Schafer, J.L. 1997. Analysis of
Incomplete Multivariate Data. New York: Chapman & Hall.
Shaadan, N., Deni,
S.M. & Jemain, A.A. 2012. Assessing and comparing PM10 pollutant behaviour using functional
data approach. Sains Malaysiana 41(11): 1335-1344.
*Corresponding
author; email: yzulina@um.edu.my
|