Sains Malaysiana 46(6)(2017): 1001–1010
http://dx.doi.org/10.17576/jsm-2017-4606-20
New Discrimination Procedure of Location Model for Handling
Large Categorical Variables
(Prosedur Diskriminasi Baharu Model Lokasi untuk Mengendalikan Pemboleh Ubah Kategori Besar)
HASHIBAH HAMID*, LONG MEI MEI & SHARIPAH SOAAD SYED YAHAYA
Statistics Department,
School of Quantitative Sciences, Universiti Utara
Malaysia College of Arts and Sciences, 06010 UUM Sintok,
Kedah Darul Aman, Malaysia
Received: 16 October 2015/Accepted:
28 November 2016
ABSTRACT
The location model
proposed in the past is a predictive discriminant rule that can classify new
observations into one of two predefined groups based on mixtures of continuous
and categorical variables. The ability of location model to discriminate new
observation correctly is highly dependent on the number of multinomial cells
created by the number of categorical variables. This study conducts a
preliminary investigation to show the location model that uses maximum likelihood
estimation has high misclassification rate up to 45% on average in dealing with
more than six categorical variables for all 36 data tested. Such model
indicated highly incorrect prediction as this model performed badly for large
categorical variables even with large sample size. To alleviate the high rate
of misclassification, a new strategy is embedded in the discriminant rule by
introducing nonlinear principal component analysis (NPCA)
into the classical location model (cLM), mainly to
handle the large number of categorical variables. This new strategy is
investigated on some simulation and real datasets through the estimation of
misclassification rate using leave-one-out method. The results from numerical
investigations manifest the feasibility of the proposed model as the
misclassification rate is dramatically decreased compared to the cLM for all 18 different data settings. A practical
application using real dataset demonstrates a significant improvement and
obtains comparable result among the best methods that are compared. The overall
findings reveal that the proposed model extended the applicability range of the
location model as previously it was limited to only six categorical variables
to achieve acceptable performance. This study proved that the proposed model
with new discrimination procedure can be used as an alternative to the problems
of mixed variables classification, primarily when facing with large categorical
variables.
Keywords: Large
categorical variables; leave-one-out method; location model; nonlinear
principal component analysis; misclassification rate
ABSTRAK
Model lokasi yang dicadangkan pada masa lalu adalah satu peraturan diskriminan ramalan yang boleh mengelaskan cerapan baharu ke dalam salah satu daripada dua kumpulan yang telah ditetapkan berdasarkan campuran pemboleh ubah selanjar dan kategori. Keupayaan model lokasi untuk mendiskriminasi cerapan baharu dengan betul adalah amat bergantung kepada bilangan sel-sel multinomial yang dicipta melalui bilangan pemboleh ubah kategori. Penyelidikan ini menjalankan suatu kajian awal untuk menunjukkan model lokasi yang menggunakan anggaran kebolehjadian maksimum mempunyai kadar silap pengelasan yang tinggi sehingga 45% secara purata dalam berurusan dengan lebih daripada enam pemboleh ubah kategori bagi kesemua 36 data yang diuji. Model tersebut menunjukkan ramalan tidak tepat yang sangat tinggi kerana model ini berprestasi teruk bagi pemboleh ubah kategori besar walaupun dengan saiz sampel yang besar. Untuk mengurangkan kadar kesilapan pengelasan yang tinggi, satu strategi baharu telah diterapkan dalam peraturan diskriminan dengan memperkenalkan analisis komponen utama tak linear (NPCA) ke dalam model lokasi klasik (cLM), terutamanya untuk mengendalikan bilangan besar pemboleh ubah kategori. Strategi baharu ini dikaji pada beberapa set data simulasi dan sebenar melalui anggaran kadar silap pengelasan menggunakan kaedahleave-one-out. Hasil daripada kajian berangka menampakkan kebolehlaksanaan model yang dicadangkan dengan kadar silap pengelasan menurun secara mendadak berbanding dengan cLM untuk kesemua 18 tetapan data
yang berbeza. Aplikasi amali menggunakan set data sebenar menunjukkan penambahbaikan yang signifikan dan mendapat keputusan yang setanding dalam kalangan kaedah terbaik yang dibandingkan. Hasil kajian secara keseluruhan menunjukkan bahawa model yang dicadangkan memperluaskan rangkaian kebolehgunaan model lokasi kerana sebelum ini ia telah dihadkan kepada hanya enam pemboleh ubah kategori untuk mencapai prestasi yang boleh diterima. Kajian ini membuktikan bahawa model yang dicadangkan dengan prosedur diskriminasi yang baharu boleh digunakan sebagai alternatif kepada masalah klasifikasi pemboleh ubah campuran, terutamanya apabila berhadapan dengan pemboleh ubah kategori besar.
Kata kunci: Analisis komponen utama tak linear; kadar silap pengelasan; kaedahleave-one-out; model lokasi; pemboleh ubah kategori besar
REFERENCES
Asparoukhov, O. & Krzanowski, W.J. 2000. Non-parametric smoothing of the location model in mixed variable
discrimination. Statistics and Computing 10: 289-297.
Costa, P.S., Santos, N.C., Cunha, P., Cotter, J. & Sousa, N.
2013. The use of multiple correspondence analysis to explore associations between categories of qualitative variables in healthy
ageing. Journal of Aging Research 2013: Article ID. 302163.
doi:10.1155/2013/302163.
De Leeuw, J. 2011. Nonlinear
Principal Component Analysis and Related Techniques. UCLA:
Department of Statistics. https://escholarship.org/uc/item/7bt7j6nk.
De Leeuw, J. & Mair,
P. 2009. Gifi methods for optimal scaling in R: The
package homals. Journal of Statistical Software 31(4):
1-21. http://www.jstatsoft.org/.
Donoho, D.L. 2000. High-dimensional data
analysis: The curses and blessings of dimensionality. AMS Math Challenges
Lecture. pp. 1-33. http://mlo.cs.man.ac.uk/resources/Curses. pdf.
Fan, J. & Li, R. 2006. Statistical challenges with high dimensionality: Feature selection in knowledge
discovery. In Feature Selection in Knowledge Discovery. pp. 1-27. doi:10.4171/022-3/31.
Fan, J. & Lv, J. 2010. A selective overview of variable selection in
high dimensional feature space. Statistica Sinica20(1): 101-148. doi:10.1063/1.3520482.
Ferrari, P.A. & Manzi, G. 2010. Nonlinear principal component analysis as a tool for the evaluation
of customer satisfaction. Quality Technology and Quantitative
Management 7(2): 117- 132. http://air.unimi.it/handle/2434/141402\nhttp://web2.cc.nctu.edu.tw/~qtqm/qtqmpapers/2010V7N2/2010V7N2_
F2.pdf.
Gervini, D.
& Rousson, V. 2004. Criteria
for evaluating dimension-reducing components for multivariate data. The
American Statistician 58(1): 72-76. doi:10.1198/0003130042863.
Gupta,
V. 2013. Exploring Data Generated by Pocket Devices. London. http://files.howtolivewiki.com/SMART_CITIES/ The_Smart_City.To_Whos_Advantage.Pocket_Devices_
and_Data_Trails.Vinay_Gupta.pdf.
Hamid, H. 2010. A new approach for classifying large number of mixed variables. International Scholarly and Scientific Research and Innovation 4(10):
120-125. doi:14621.
Hamid, H. 2014.
Integrated smoothed location model and data reduction approaches for multi
variables classification. Doctoral Dissertation. Universiti Utara Malaysia, Malaysia (Unpublished).
Hamid,
H. & Mahat, N.I. 2013. Using principal
component analysis to extract mixed variables for smoothed location model. Far
East Journal of Mathematical Sciences (FJMS) 80(1): 33-54.
Katz,
M.H. 2011. Multivariate Analysis: A Practical Guide for Clinicians and Public Health
Researchers. Cambridge: Cambridge University Press.
Krzanowski, W.J. 1995. Selection of variables, and assessment of their performance, in
mixed-variable discriminant analysis. Computational Statistics &
Data Analysis 19: 419-431. doi:10.1016/0167-9473(94)00011-7.
Krzanowski, W.J. 1993. The location model for mixtures of categorical and continuous
variables. Journal of Classification 10(1): 25-49. doi:10.1007/BF02638452.
Krzanowski, W.J. 1983. Stepwise location model choice in mixed-variable discrimination. Journal of the Royal Statistical Society. Series C (Applied Statisitcs) 32(3): 260-266.
Krzanowski, W.J. 1975. Discrimination and classification using both binary and continuous
variables. Journal of American Statistical Association 70(352):
782-790.
Li, Q. 2006. An integrated framework of feature selection and extraction for
appearance-based recognition. Doctoral Dissertation. University of Delaware Newark, DE, USA (Unpublished).
Linting,
M., Meulman, J.J., Groenen,
P.J.F. & Van der Kooij, A.J. 2007. Nonlinear principal
components analysis: Introduction and application. Psychological Methods 12(3):
336-358. doi:10.1037/1082-989X.12.3.336.
Linting, M.
& Van der Kooij, A.J. 2012. Nonlinear principal
components analysis with CATPCA: A tutorial. Journal of Personality
Assessment 94(1): 12-25. doi:10.1080/0022389
1.2011.627965.
Long,
M.M. 2016.
Binary variable extraction using nonlinear
principal component analysis in classical location model.
Master Dissertation. Universiti
Utara Malaysia, Malaysia (Unpublished).
Mahat,
N.I. 2006. Some investigations in discriminant analysis with mixed variables. Doctoral Dissertation. University of Exeter, London, UK
(Unpublished).
Mahat,
N.I., Krzanowski, W.J. & Hernandez, A. 2009. Strategies for
non-parametric smoothing of the location model in mixed-variable discriminant
analysis. Modern Applied Science 3(1): 151-163.
Mahat,
N.I., Krzanowski, W.J. & Hernandez, A. 2007. Variable selection in
discriminant analysis based on the location model for mixed variables. Advances
in Data Analysis and Classification 1(2): 105-122. doi:10.1007/s11634-007-
0009-9.
Manisera,
M., A.J. Van der Kooij, & Dusseldorp,
E. 2010. Identifying the component structure of job satisfaction by
nonlinear principal components analysis. Quality Technology and
Quantitative Management 7: 97-115. http://
elisedusseldorp.nl/pdf/Manisera_QTQM2010.pdf.
Mohd Aris, Khairul Dahri, Faizal Mustapha, Mohd Sapuan Salit & Dayang Laila Abang Abdul Majid.
2014. Condition structural index using principal component analysis for
undamaged, damage and repair conditions of carbon fiber-reinforced plastic
laminate. Journal of Intelligent Material Systems and Structures 25(5):
575-584. doi:10.1177/1045389X13494932.
Ramadevi,
G.N. & Usharaani, K. 2013. Study on
dimensionality reduction techniques and applications. Publications of
Problems & Application in Engineering Research 4(1): 134-140.
Russom, P. 2013. Managing Big Data. TWDI Best
Practices Report. Washington: twdi.org.
Solanas,
A., Manolov, R., Leiva, D.
& Richard, M.M. 2011. Retaining principal components
for discrete variables. Anuario de Psicologia41(1-3): 33-50.
Vlachonikolis,
I.G. & Marriott, F.H.C. 1982. Discrimination
with mixed binary and continuous data. Applied
Statistics 31(1): 23-31.
Young, P.D. 2009.
Dimension reduction and missing data in statistical discrimination. Doctoral Dissertation. USA Baylor University (Unpublished).
Zheng,
H. & Zhang, Y. 2008. Feature selection for high-dimensional data in astronomy. Advances in Space
Research 41(12): 1960-1964. doi:10.1016/j.asr.2007.08.033.
*Corresponding
author; email: hashibah@uum.edu.my