Sains Malaysiana 46(6)(2017): 1001–1010
http://dx.doi.org/10.17576/jsm-2017-4606-20
New Discrimination Procedure of Location Model for Handling
Large Categorical Variables
(Prosedur Diskriminasi Baharu Model Lokasi untuk Mengendalikan Pemboleh Ubah Kategori Besar)
HASHIBAH HAMID*, LONG MEI MEI & SHARIPAH SOAAD SYED YAHAYA
Statistics Department,
School of Quantitative Sciences, Universiti Utara
Malaysia College of Arts and Sciences, 06010 UUM Sintok,
Kedah Darul Aman, Malaysia
Diserahkan: 16 Oktober 2015/Diterima: 28
November 2016
ABSTRACT
The location model
proposed in the past is a predictive discriminant rule that can classify new
observations into one of two predefined groups based on mixtures of continuous
and categorical variables. The ability of location model to discriminate new
observation correctly is highly dependent on the number of multinomial cells
created by the number of categorical variables. This study conducts a
preliminary investigation to show the location model that uses maximum
likelihood estimation has high misclassification rate up to 45% on average in
dealing with more than six categorical variables for all 36 data tested. Such
model indicated highly incorrect prediction as this model performed badly for
large categorical variables even with large sample size. To alleviate the high
rate of misclassification, a new strategy is embedded in the discriminant rule
by introducing nonlinear principal component analysis (NPCA)
into the classical location model (cLM), mainly to handle
the large number of categorical variables. This new strategy is investigated on
some simulation and real datasets through the estimation of misclassification
rate using leave-one-out method. The results from numerical investigations
manifest the feasibility of the proposed model as the misclassification rate is
dramatically decreased compared to the cLM for all 18
different data settings. A practical application using real dataset
demonstrates a significant improvement and obtains comparable result among the
best methods that are compared. The overall findings reveal that the proposed
model extended the applicability range of the location model as previously it
was limited to only six categorical variables to achieve acceptable
performance. This study proved that the proposed model with new discrimination
procedure can be used as an alternative to the problems of mixed variables
classification, primarily when facing with large categorical variables.
Keywords: Large
categorical variables; leave-one-out method; location model; nonlinear
principal component analysis; misclassification rate
ABSTRAK
Model lokasi yang dicadangkan pada masa lalu adalah satu peraturan diskriminan ramalan yang boleh mengelaskan cerapan baharu ke dalam salah satu daripada dua kumpulan yang telah ditetapkan berdasarkan campuran pemboleh ubah selanjar dan kategori. Keupayaan model lokasi untuk mendiskriminasi cerapan baharu dengan betul adalah amat bergantung kepada bilangan sel-sel multinomial yang dicipta melalui bilangan pemboleh ubah kategori. Penyelidikan ini menjalankan suatu kajian awal untuk menunjukkan model lokasi yang menggunakan anggaran kebolehjadian maksimum mempunyai kadar silap pengelasan yang tinggi sehingga 45% secara purata dalam berurusan dengan lebih daripada enam pemboleh ubah kategori bagi kesemua 36 data yang diuji. Model tersebut menunjukkan ramalan tidak tepat yang sangat tinggi kerana model ini berprestasi teruk bagi pemboleh ubah kategori besar walaupun dengan saiz sampel yang besar. Untuk mengurangkan kadar kesilapan pengelasan yang tinggi, satu strategi baharu telah diterapkan dalam peraturan diskriminan dengan memperkenalkan analisis komponen utama tak linear (NPCA) ke dalam model lokasi klasik (cLM), terutamanya untuk mengendalikan bilangan besar pemboleh ubah kategori. Strategi baharu ini dikaji pada beberapa set data simulasi dan sebenar melalui anggaran kadar silap pengelasan menggunakan kaedahleave-one-out. Hasil daripada kajian berangka menampakkan kebolehlaksanaan model yang dicadangkan dengan kadar silap pengelasan menurun secara mendadak berbanding dengan cLM untuk kesemua 18 tetapan data
yang berbeza. Aplikasi amali menggunakan set data sebenar menunjukkan penambahbaikan yang signifikan dan mendapat keputusan yang setanding dalam kalangan kaedah terbaik yang dibandingkan. Hasil kajian secara keseluruhan menunjukkan bahawa model yang dicadangkan memperluaskan rangkaian kebolehgunaan model lokasi kerana sebelum ini ia telah dihadkan kepada hanya enam pemboleh ubah kategori untuk mencapai prestasi yang boleh diterima. Kajian ini membuktikan bahawa model yang dicadangkan dengan prosedur diskriminasi yang baharu boleh digunakan sebagai alternatif kepada masalah klasifikasi pemboleh ubah campuran, terutamanya apabila berhadapan dengan pemboleh ubah kategori besar.
Kata kunci: Analisis komponen utama tak linear; kadar silap pengelasan; kaedahleave-one-out; model lokasi; pemboleh ubah kategori besar
RUJUKAN
Asparoukhov, O. & Krzanowski, W.J. 2000. Non-parametric smoothing of the location model in mixed variable
discrimination. Statistics and Computing 10: 289-297.
Costa, P.S., Santos, N.C., Cunha, P., Cotter, J. & Sousa, N.
2013. The use of multiple correspondence analysis to explore associations between categories of qualitative variables in healthy
ageing. Journal of Aging Research 2013: Article ID. 302163.
doi:10.1155/2013/302163.
De Leeuw, J. 2011. Nonlinear
Principal Component Analysis and Related Techniques. UCLA:
Department of Statistics. https://escholarship.org/uc/item/7bt7j6nk.
De Leeuw, J. & Mair,
P. 2009. Gifi methods for optimal scaling in R: The
package homals. Journal of Statistical Software 31(4):
1-21. http://www.jstatsoft.org/.
Donoho, D.L. 2000. High-dimensional data
analysis: The curses and blessings of dimensionality. AMS Math Challenges
Lecture. pp. 1-33. http://mlo.cs.man.ac.uk/resources/Curses. pdf.
Fan, J. & Li, R. 2006. Statistical challenges with high dimensionality: Feature selection in knowledge
discovery. In Feature Selection in Knowledge Discovery. pp. 1-27. doi:10.4171/022-3/31.
Fan, J. & Lv, J. 2010. A selective overview of variable selection in
high dimensional feature space. Statistica Sinica20(1): 101-148. doi:10.1063/1.3520482.
Ferrari, P.A. & Manzi, G. 2010. Nonlinear principal component analysis as a tool for the evaluation
of customer satisfaction. Quality Technology and Quantitative
Management 7(2): 117- 132. http://air.unimi.it/handle/2434/141402\nhttp://web2.cc.nctu.edu.tw/~qtqm/qtqmpapers/2010V7N2/2010V7N2_
F2.pdf.
Gervini, D.
& Rousson, V. 2004. Criteria
for evaluating dimension-reducing components for multivariate data. The
American Statistician 58(1): 72-76. doi:10.1198/0003130042863.
Gupta,
V. 2013. Exploring Data Generated by Pocket Devices. London. http://files.howtolivewiki.com/SMART_CITIES/ The_Smart_City.To_Whos_Advantage.Pocket_Devices_
and_Data_Trails.Vinay_Gupta.pdf.
Hamid, H. 2010. A new approach for classifying large number of mixed variables. International Scholarly and Scientific Research and Innovation 4(10):
120-125. doi: 14621.
Hamid, H. 2014.
Integrated smoothed location model and data reduction approaches for multi
variables classification. Doctoral Dissertation. Universiti Utara Malaysia, Malaysia (Unpublished).
Hamid,
H. & Mahat, N.I. 2013. Using principal
component analysis to extract mixed variables for smoothed location model. Far
East Journal of Mathematical Sciences (FJMS) 80(1): 33-54.
Katz,
M.H. 2011. Multivariate Analysis: A Practical Guide for Clinicians and Public Health
Researchers. Cambridge: Cambridge University Press.
Krzanowski, W.J. 1995. Selection of variables, and assessment of their performance, in
mixed-variable discriminant analysis. Computational Statistics &
Data Analysis 19: 419-431. doi:10.1016/0167-9473(94)00011-7.
Krzanowski, W.J. 1993. The location model for mixtures of categorical and continuous
variables. Journal of Classification 10(1): 25-49. doi:10.1007/BF02638452.
Krzanowski, W.J. 1983. Stepwise location model choice in mixed-variable discrimination. Journal of the Royal Statistical Society. Series C (Applied Statisitcs) 32(3): 260-266.
Krzanowski, W.J. 1975. Discrimination and classification using both binary and continuous
variables. Journal of American Statistical Association 70(352):
782-790.
Li, Q. 2006. An integrated framework of feature selection and extraction for
appearance-based recognition. Doctoral Dissertation. University of Delaware Newark, DE, USA (Unpublished).
Linting,
M., Meulman, J.J., Groenen,
P.J.F. & Van der Kooij, A.J. 2007. Nonlinear principal
components analysis: Introduction and application. Psychological Methods 12(3):
336-358. doi:10.1037/1082-989X.12.3.336.
Linting, M.
& Van der Kooij, A.J. 2012. Nonlinear principal
components analysis with CATPCA: A tutorial. Journal of Personality
Assessment 94(1): 12-25. doi:10.1080/0022389
1.2011.627965.
Long,
M.M. 2016.
Binary variable extraction using nonlinear principal
component analysis in classical location model. Master
Dissertation. Universiti Utara
Malaysia, Malaysia (Unpublished).
Mahat,
N.I. 2006. Some investigations in discriminant analysis with mixed variables. Doctoral Dissertation. University of Exeter, London, UK
(Unpublished).
Mahat,
N.I., Krzanowski, W.J. & Hernandez, A. 2009. Strategies for
non-parametric smoothing of the location model in mixed-variable discriminant
analysis. Modern Applied Science 3(1): 151-163.
Mahat,
N.I., Krzanowski, W.J. & Hernandez, A. 2007. Variable selection in
discriminant analysis based on the location model for mixed variables. Advances
in Data Analysis and Classification 1(2): 105-122. doi:10.1007/s11634-007-
0009-9.
Manisera,
M., A.J. Van der Kooij, & Dusseldorp,
E. 2010. Identifying the component structure of job satisfaction by
nonlinear principal components analysis. Quality Technology and
Quantitative Management 7: 97-115. http://
elisedusseldorp.nl/pdf/Manisera_QTQM2010.pdf.
Mohd Aris, Khairul Dahri, Faizal Mustapha, Mohd Sapuan Salit & Dayang Laila Abang Abdul Majid.
2014. Condition structural index using principal component analysis for
undamaged, damage and repair conditions of carbon fiber-reinforced plastic
laminate. Journal of Intelligent Material Systems and Structures 25(5):
575-584. doi:10.1177/1045389X13494932.
Ramadevi,
G.N. & Usharaani, K. 2013. Study on
dimensionality reduction techniques and applications. Publications of
Problems & Application in Engineering Research 4(1): 134-140.
Russom, P. 2013. Managing Big Data. TWDI Best
Practices Report. Washington: twdi.org.
Solanas,
A., Manolov, R., Leiva, D.
& Richard, M.M. 2011. Retaining principal components
for discrete variables. Anuario de Psicologia41(1-3): 33-50.
Vlachonikolis,
I.G. & Marriott, F.H.C. 1982. Discrimination
with mixed binary and continuous data. Applied
Statistics 31(1): 23-31.
Young, P.D. 2009.
Dimension reduction and missing data in statistical discrimination. Doctoral Dissertation. USA Baylor University (Unpublished).
Zheng,
H. & Zhang, Y. 2008. Feature selection for high-dimensional data in astronomy. Advances in Space
Research 41(12): 1960-1964. doi:10.1016/j.asr.2007.08.033.
*Pengarang untuk surat-menyurat; email: hashibah@uum.edu.my
|