Sains Malaysiana 49(2)(2020):
447-459
http://dx.doi.org/10.17576/jsm-2020-4902-24
Ensemble Learning
for Multidimensional Poverty Classification
(Pembelajaran Ensembel untuk Pengelasan Kemiskinan Pelbagai Dimensi)
AZURALIZA ABU
BAKAR*, RUSNITA HAMDAN & NOR SAMSIAH SANI
Center for Artificial Intelligence Technology, Faculty
of Information Science & Technology, 46300 UKM Bangi,
Selangor Darul Ehsan, Malaysia
Diserahkan: 13 Mac 2019/Diterima: 10 November 2019
ABSTRACT
The poverty rate in
Malaysia is determined through financial or income indices and measurements. As
such, periodic measurements are conducted through Household Expenditure and
Income Survey (HEIS) twice every five years, and subsequently used to generate
a Poverty Line Income (PLI) to determine poverty levels through statistical
methods. Such uni-dimensional measurement however is
unable to portray the overall deprivation conditions, especially based on the
experience of the urban population. In addition, the United Nation Development
Programme (UNDP) has introduced a set of multi-dimensional poverty measurements
but is yet to be applied in the case of Malaysia. In view of this, a potential
use of Machine Learning (ML) approaches that can produce new poverty
measurement methods is therefore of interest, which must be triggered by the
existence of a rich database collection on poverty, such as the eKasih database maintained by the Malaysian Government. The
goal of this study was to determine whether ensemble learning method (random
forest) can classify poverty and hence produce multidimensional poverty
indicator compared to based learner method using eKasih dataset. CRoss Industry
Standard Process for Data Mining (CRISP-DM) methods was used to ensure data mining and ML
processes were conducted properly. Beside Random Forest, we also examined
decision tree and general linear methods to benchmark their performance and
determine the method with the highest accuracy. Fifteen variables were then
rank using varImp method to search for important
variables. Analysis of this study showed that Per Capita Income, State, Ethnic,
Strata, Religion, Occupation and Education were found to be the most important
variables in the classification of poverty at a rate of 99% accuracy confidence
using Random Forest algorithm.
Keywords: Machine
learning; multidimensional poverty; random forest
ABSTRAK
Kadar kemiskinan di
Malaysia ditentukan melalui pengukuran perspektif kewangan atau pendapatan. Pengukuran berkala dilakukan melalui Bancian Perbelanjaan Rumah dan Penyiasatan Pendapatan (HEIS) dua tahun sekali digunakan untuk menghasilkan Paras Garis Kemiskinan (PGK) dalam menentukan tahap kemiskinan menggunakan kaedah statistik. Pengukuran uni-dimensi itu bagaimanapun tidak dapat menggambarkan keadaan kekurangan keseluruhan yang terutamanya dialami penduduk bandar. Program Pembangunan Bangsa-Bangsa Bersatu (PBB) telah memperkenalkan satu kaedah pengukuran kemiskinan pelbagai dimensi yang belum digunakan di Malaysia. Oleh itu, potensi penggunaan pendekatan Pembelajaran Mesin (ML) untuk menghasilkan kaedah pengukuran kemiskinan yang baru adalah tinggi disebabkan oleh adanya pengumpulan pangkalan data kemiskinan yang utama seperti pangkalan data eKasih yang dikendalikan oleh Kerajaan Malaysia. Tujuan kajian ini untuk membuktikan kaedah pembelajaran mesin bergabung (hutan rawak) boleh mengkelaskan kemiskinan dengan ketepatan yang tinggi dan dapat menyenaraikan indikator pelbagai dimensi kemiskinan berbanding dengan kaedah pembelajaran asas menggunakan dataset eKasih. Metod CRoss Industry Standard Process for Data Mining (CRISP-DM) digunakan untuk memastikan perlombongan data dan proses ML dijalankan dengan baik. Di samping Hutan Rawak, kami juga mengkaji pokok keputusan dan kaedah linear am untuk menanda aras prestasi mereka dan menentukan kaedah terbaik dengan ketepatan tertinggi. Lima belas pemboleh ubah disusun menggunakan kaedah varImp untuk mencari pemboleh ubah penting. Analisis kajian ini menunjukkan bahawa Pendapatan Perkapita, Negeri, Etnik, Strata, Agama, Pekerjaan dan Pendidikan didapati sebagai faktor yang paling penting dalam mengkelaskan kemiskinan pada kadar kepercayaan ketepatan 99% dengan menggunakan algoritma hutan secara rawak.
Kata kunci: Hutan rawak; kemiskinan pelbagai dimensi; pembelajaran mesin
RUJUKAN
Adomavicius, G. & Tuzhilin, A. 2001. Using
data mining methods to build customer profiles. Computer
34(2): 74-81.
Ahmad, W.D. & Abu
Bakar, A. 2018. Classification models for higher learning scholarship. Asia-Pacific
Journal of Information Technology and Multimedia 7(2): 131-145.
Albashish, D., Sahran,
S., Abdullah, A., Shukor, N.A. & Pauzi, S. 2016. Ensemble learning of
tissue components for prostate histopathology image grading. International
Journal on Advanced Science, Engineering and Information Technology 6(6):
1134-1140.
Alsac, A., Colak, M.
& Keskin, G.A. 2017. An integrated customer relationship management and
data mining framework for customer classification and risk analysis in health
sector. 6th International Conference on Industrial Technology and
Management, ICITM 2017. pp. 41-46.
Bambang Widjanarko Otok.
& Dian Seftiana. 2015. The classification of poor households
in jombang with random forest classification and regression trees
(RF-CART) approach as the solution in achieving the 2015 Indonesian
MDGs' targets. International Journal of Science and Research
3(8): 1497-1503.
Chen, G.B., Li., S.S.,
Knibbs, L.D., Hamm, N.A.S., Cao, W., Li, T.T., Guo, J.P., Ren, H.Y., Abramson,
M.J. & Guo, Y.M. 2018. A machine learning method to estimate PM2.5 concentrations across China with remote sensing, meteorological and land use
information. Science of The Total Environment 636: 52-60.
Deng, H.L., Zhang, L.J.
& Su, W.K. 2016. Clustering the families successfully applying for minimum
living standard security system based on K-means algorithm. 12th
International Conference on Computational Intelligence and Security. pp.
494-498.
DOSM. 2017. Department
of Statistics Malaysia Press Release Report of Household Income and Basic
Amenities Survey 2016. Report of Household Income and Basic Amenities Survey
2016. doi:10.1021/ja064532c.
Doycheva, K., Horn, G.,
Koch, C., Schumann, A. & König, M. 2017. Assessment and weighting of
meteorological ensemble forecast members based on supervised machine learning
with application to runoff simulations and flood warning. Advanced
Engineering Informatics 33: 427-439.
Husam, I.S., Abuhamad,
Azuraliza Abu Bakar, Suhaila Zainudin, Mazrura Sahani. & Zainudin Mohd Ali.
2017. Feature selection algorithms for Malaysian dengue outbreak detection
model. Sains Malaysiana 46(2): 255-265.
Jean, N., Burke, M.,
Xie, M., Davis, W.M., Lobell, D.B. & Ermon, S. 2016. Combining satellite
imagery and machine learning to predict poverty. Science 353(6301):
790-794.
Kshirsagar, V.,
Wieczorek, J., Ramanathan, S. & Wells, R. 2017. Household poverty
classification in data-scarce environments: A machine learning approach. NIPS
2017 Workshop on Machine Learning for the Developing World.
http://arxiv.org/abs/1711.06813.
McBride, L. &
Nichols, A. 2016. Retooling poverty targeting using out-of-sample validation
and machine learning. The World Bank Economic Review 32(3): 531-550.
Natita Wangsoh,
Wiboonsak Watthayu. & Dusadee Sukawat. 2017. A hybrid climate model for
rainfall forecasting based on combination of self- organizing map and analog
method. Sains Malaysiana 46(12): 2541-2547.
Nor Samsiah Sani,
Mariah Abdul Rahman, Azuraliza Abu Bakar, Shahnurbanon Sahran. & Hafiz Mohd
Sarim. 2018. Machine learning approach for bottom 40 percent households (B40)
poverty classification. International Journal on Advanced Science,
Engineering and Information Technology 8(4-2): 1698.
Nor Samsiah Sani, Illa
Iza Suhana Shamsuddin, Shahnorbanun Sahran, Abdul Hadi Abd Rahman. & Ereena
Nadjimin Muzaffar. 2018. Redefining selection of features and classification
algorithms for room occupancy detection. International
Journal on Advanced Science, Engineering and Information Technology 8(4-2):
1486-1493.
Othman, Zalinda, Soo
Wui Shan, Ishak Yusoff. & Chang Peng Kee. 2018. Classification techniques
for predicting graduate employability. International Journal on Advanced
Science, Engineering and Information Technology 8(4-2): 1712-1720.
Pavithra, R. &
Sudha, P. 2018. A survey on classification in R programming using data mining. International
Journal of Research in Engineering, Science and Management 1(9): 401-403.
Perez, A. & Azzari,
G. 2017. Poverty prediction with public landsat 7 satellite imagery and machine
learning. NIPS 2017 Workshop on Machine Learning for the Developing World.
https://arxiv.org/abs/1711.03654.
Sano, A.V.D. &
Nindito, H. 2011. Application of K-Means algorithm for cluster analysis on
poverty of provinces in Indonesia. ComTech: Computer, Mathematics and
Engineering Applications 7(6): 141-150.
Santoso & Mohammad
Isa Irawan. 2016. Classification of poverty levels using k-nearest neighbor and
learning vector quantization. International Journal of Computing Science and
Applied Mathematics 2(1): 8-13.
Sohnesen, T.P. &
Stender, N. 2017. Is random forest a superior methodology for predicting
poverty? An empirical assessment. Poverty and Public Policy 9(1):
118-133.
Thoplan, R. 2014.
Random forests for poverty classification. International Journal of
Sciences: Basic and Applied Research 4531(8): 252-259.
Unit Perancang Ekonomi.
2015. Rancangan Malaysia Ke-11 (2016-2020). Unit Perancang Ekonomi,
Jabatan Perdana Menteri. Kuala Lumpur: Percetakan Nasional Malaysia Berhad.
http://www.epu.gov.my.
Vafeiadis, T.,
Diamantaras, K.I., Sarigiannidis, G. & Chatzisavvas, K.C. 2015. A
comparison of machine learning techniques for customer churn prediction. Simulation
Modelling Practice and Theory 55: 1-9.
Wirth, R. 2000.
CRISP-DM: Towards a standard process model for data mining. Proceedings of
the Fourth International Conference on the Practical Application of Knowledge
Discovery and Data Mining 24959: 29-39.
Wrzesień, M.,
Waldemar, T., Klamkowski, K. & Rudnicki, W.R. 2019. Prediction of the apple
scab using machine learning and simple weather stations. Computers and
Electronics in Agriculture 161: 252-259.
Wu, R., Yan, S., Shan,
Y., Dang, Q. & Sun, G. 2019. Deep image: Scaling up image recognition. Arxiv.Org.
Accessed by May 15. https://arxiv.org/vc/arxiv/papers/1501/1501.02876v1.pdf.
Yang, X., Liu, W., Tao,
D. & Cheng, J. 2019. Canonical correlation analysis networks for two-view
image recognition. Information Sciences 385-386: 338-352.
Zheng, H., Fu, J., Mei,
T. & Luo, J. 2019. Learning multi-attention convolutional neural network
for fine-grained image recognition. The IEEE International Conference on
Computer Vision (ICCV) 2017:
5209-5217.
Zhong, J., Zhang, X.
& Wang, Y. 2019. Relatively weak meteorological feedback effect on PM2.5 mass change in winter 2017/18 in the Beijing area: Observational evidence and
machine-learning estimations. Science of The Total Environment 664:
140-147.
*Pengarang untuk surat-menyurat; email:
azu1328@yahoo.com
|