Sains Malaysiana
52(5)(2023):
1595-1606
http://doi.org/10.17576/jsm-2023-5205-20
Identifying Multiple Outliers in
Linear Functional Relationship Model Using a Robust Clustering Method
(Menentukan Data Terpencil Berganda bagi Model
Linear Hubungan Fungsian Menggunakan Kaedah Berkelompok yang Lebih Kukuh)
ADILAH ABDUL GHAPOR1,*, YONG
ZULINA ZUBAIRI2, SAYED MD. AL MAMUN3, SITI FATIMAH HASSAN4,
ELAYARAJA ARUCHUNAN5 & NURKHAIRANY AMYRA MOKHTAR6
1Department
of Decision Science, Faculty of Business and Economics, Universiti Malaya,
50603 Kuala Lumpur, Federal Territory, Malaysia
2Institute of
Advanced Studies, Universiti Malaya, 50603 Kuala Lumpur, Federal Territory,
Malaysia
3Department of
Statistics, University of Rajshahi, Bangladesh
4Centre for
Foundation Studies in Science, Universiti Malaya, Kuala Lumpur, Malaysia
5Institute
of Mathematical Sciences, Faculty of Science, Universiti Malaya, 50603 Kuala
Lumpur, Federal Territory, Malaysia
6Mathematical Sciences Studies, College of Computing,
Informatics and Media, Universiti Teknologi MARA, 85000 Segamat, Johor Darul
Takzim, Malaysia
Received:
12 October 2022/Accepted: 10 May 2023
Abstract
Outliers are some observation points
outside the usual pattern of the other observations. It is essential to detect
outliers as anomalous observations can affect the inference made in the
analysis. In this study, we propose an efficient clustering procedure to
identify multiple outliers in the linear functional relationship model using
the single linkage algorithm with the Euclidean distance as the similarity
measure. A new robust cut-off point using the median and median absolute
deviation for the tree heights to classify the potential outliers are proposed
in this study. Experimental results from the simulation study suggest our
proposed method is able to identify the presence of multiple outliers with very
small probability of swamping and masking. Application in real data also shows
that the proposed clustering method for this linear functional relationship
model successfully detects the outliers, thus suggesting the method's
practicality in real-world problems.
Keywords: Clustering; linear; measurement
error; multiple outliers
Abstrak
Data terpencil merupakan
pemerhatian data yang berada di luar corak pemerhatian data yang lain.
Menentukan data terpencil adalah penting kerana pemerhatian yang luar biasa
boleh mempengaruhi inferens yang dibuat ke atas analisis tersebut. Dalam kajian
ini, kami mencadangkan kaedah berkelompok yang lebih kukuh untuk menentukan
data terpencil berganda bagi model linear hubungan fungsian (LFRM) menggunakan
satu hubungan algoritma dengan jarak Euclidean sebagai ukuran bersama. Satu
nilai potongan yang kukuh dicadangkan untuk mengumpulkan data terpencil
berganda dengan menggunakan median dan median sisihan mutlak bagi menentukan
ketinggian pokok tersebut. Keputusan uji kaji berdasarkan simulasi menunjukkan
kaedah yang dicadangkan berjaya mengesan data terpencil berganda di dalam
sesebuah set data dan menunjukkan prestasi yang bagus dengan nilai ‘masking’
dan ‘swamping’ yang rendah. Aplikasi pada data sebenar juga menunjukkan kaedah
berkelompok yang dicadangkan bagi model linear hubungan fungsian (LFRM) ini
berjaya menentukan data terpencil, justeru, dicadangkan penggunaan kaedah ini
dalam aplikasi pada data dunia yang sebenar.
Kata kunci: Berkelompok;
kesilapan pengukuran; linear; terpencil berganda
REFERENCES
Atkinson, A. 1985. Plots, Transformations, and Regression: An Introduction to Graphical
Methods of Diagnostic Regression Analysis. Oxford: Clarendon Press.
Adnan, R., Mohamad, M.N. & Setan, H.
2003. Multiple outliers detection procedures in linear regression. Matematika 19: 29-45.
Aldenderfer, M.S. & Blashfield, R.K.
1984. Cluster Analysis: Quantitative
Applications in the Social Sciences. A SAGE
Publications.
Arif, A.M.,
Zubairi, Y.Z. & Hussin, A.G. 2022. Outlier detection in balanced replicated
linear functional relationship model. Sains
Malaysiana 51(2): 599-607. https://doi.org/10.17576/jsm-2022-5102-23.
Arif, A.M.,
Zubairi, Y.Z. & Hussin, A.G. 2020. Parameter estimation in replicated
linear functional relationship model in the presence of outliers. Malaysian Journal of Fundamental and Applied
Sciences 16(2): 158-160. https://doi.org/10.11113/mjfas.v16n2.1633
Barnett, V. & Lewis, T. 1984. Outliers in Statistical Data. 2nd ed. New York: Wiley.
Brzezińska, A.N.
& Horyń, C. 2021. Outliers in COVID 19 data based on rule
representation - the analysis of LOF algorithm. Procedia Comput. Sci. 192: 3010-3019. doi:
10.1016/j.procs.2021.09.073
Hampel, F.R., Ronchetti, E.M., Rousseeuw,
P.J. & Stahel, W.A. 2011. Robust Statistics: The Approach Based on
Influence Functions. New York: John Wiley & Sons.
He, Z., Xu, X. & Deng, S.
2003. Discovering cluster-based local outliers. Pattern
Recognition Letters 24(9):
1641-1650.
Ilbeigipour,
S., Albadvi, A. & Akhondzadeh Noughabi, E. 2022. Cluster-based analysis of
COVID-19 cases using self-organizing map neural network and K-means methods to
improve medical decision-making. Informatics in Medicine Unlocked 32: 101005.
https://doi.org/10.1016/j.imu.2022.101005
Kaufman, L. & Rousseeuw, P.J. 1990. Finding Groups in
Data: An Introduction to Cluster Analysis. New York: John Wiley
& Sons, Inc.
Kendall,
M.G. 1951. Regression, structure and functional relationship, Part I. Biometrika 38(1/2): 11-25.
Kendall,
M.G. 1952. Regression, structure and functional relationship, Part II, Biometrika 39(1/2): 96-108.
Kumar, S. 2020. Use of cluster analysis to monitor novel
coronavirus-19 infections in Maharashtra, India. Indian Journal of
Medical Sciences 72(2):
44-48. https://doi.org/10.25259/IJMS_68_2020
Li, Y., Jin, D.C., Bao, Z.B., Jin, H., Guo, J.W., Zhao, Y.L., Shao, J.
& Yang, D. 2016. Advances in Energy, Environment and Materials Science. Boca Raton: CRC Press.
Mokhtar, N.A., Zubairi, Y.Z., Hussin, A.G.,
Badyalina, B., Ghazali, A.F., Ya’Acob, F.F., Shamala, P. & Kerk, L.C. 2021.
Modelling wind direction data of Langkawi Island during Southwest monsoon in
2019 to 2020 using bivariate linear functional relationship model with von
Mises distribution. Journal of Physics:
Conference Series 1988(1): 012097.
https://doi.org/10.1088/1742-6596/1988/1/012097
Milligan, G.W.
& Cooper, M.C. 1985. An examination of procedures for determining the number
of clusters in a data set. Psychometrika 50(2): 159-179.
Mojena,
R. 1977. Hierarchical grouping
methods and stopping rules: An evaluation. The Computer Journal 20(4):
359-363.
Oh,
J.H. & Gao, J. 2009. A
kernel-based approach for detecting outliers of high-dimensional biological
data. BMC Bioinformatics 10(4): S7.
O'Leary,
B., Reiners, J.J., Xu, X. & Lemke, L.D. 2016. Identification and influence of spatio-temporal
outliers in urban air quality measurements. Science
of the Total Environment 573: 55-65.
Sebert, D.M.,
Montgomery, D.C. & Rollier, D.A. 1998. A clustering
algorithm for identifying multiple outliers in linear regression. Computational
Statistics & Data Analysis 27(4): 461-484.
Syaiba,
B.A. & Midi, H. 2010. Robust logistic diagnostic for the identification of
high leverage points in logistic regression model. Journal of Applied
Sciences 10(23): 3042-3050.
Rousseeuw,
P.J. & Leroy, A. 1987. Robust
Regression and Outlier Detection. New York: Wiley.
Toutenburg,
H., Chatterjee, S. & Hadi, A.S. 1990. Sensitivity analysis in linear
regression. Statistical Papers 31: 232.
Ultsch,
A. & Lötsch, J. 2022. Euclidean distance-optimized data transformation for
cluster analysis in biomedical data (EDOtrans). BMC Bioinformatics 23: 233.
Wang, L.,
Zhang, Y. & Feng, J. 2005. On the Euclidean distance of images. IEEE
Transactions on Pattern Analysis and Machine Intelligence 27(8): 1334-1339.
*Corresponding author; email: adilahghapor@gmail.com
|