Sains Malaysiana 46(9)(2017): 1449–1455
http://dx.doi.org/10.17576/jsm-2017-4609-13
Parallelization of Logic
Regression Analysis on SNP-SNP Interactions of a Crohn’s Disease
Dataset Model
(Analisis
Regresi Logik
Keselarian pada Interaksi SNP-SNP Model
Dataset Penyakit Crohn)
UNITSA
SANGKET1*,
SURAKAMETH
MAHASIRIMONGKOL2,
PICHAYA
TANDAYYA3,
SURASAK
SANGKHATHAT4,
WASUN
CHANTRATITA5,
QI
LIU6
& YUTAKA YASUI6
1Department of Molecular Biotechnology
and Bioinformatics, Center for Genomics and Bioinformatics Research,
Faculty of Science, Prince of Songkla
University, Songkhla,
90112
Thailand
2Medical
Genetic Section, National Institute of Health, Department of Medical
Sciences
Ministry
of Public Health, Nonthaburi, 11000 Thailand
3Department
of Computer Engineering, Faculty of Engineering, Prince of Songkla
University
Songkhla, 90112
Thailand
4Tumor
Biology Research Unit, Department of Surgery, Faculty of Medicine,
Prince of Songkla University, Songkhla, 90112
Thailand
5Department
of Pathology, Faculty of Medicine, Ramathibodhi
Hospital, Mahidol University, Bangkok,
10400 Thailand
6Department
of Public Health Sciences, School of Public Health, University of
Alberta, Edmonton, Alberta, Canada
Diserahkan: 31
Ogos 2016/Diterima: 17 Januari 2017
ABSTRACT
SNP-SNP interactions have been recognized to be basically important for understanding
genetic causes of complex disease traits. Logic regression is an
effective methods for identifying SNP-SNP interactions
associated with risk of complex disease. However, identifying SNP-SNP interactions
are computationally challenging and may take hours, weeks and months
to complete. Although parallel computing is a powerful method to
accelerate computing time, it is arduous for users to apply this
method to logic regression analyses of SNP-SNP interactions because it requires
advanced programming skills to correctly partition and distribute
data, control and monitor tasks across multi-core CPUs
or several computers, and merge output files. In this paper, we
present a novel R-library called SNPInt to automatically
speed up analyses of SNP-SNP interactions of genome-wide association
(GWA) studies using parallel computing without the advanced
programming skills. The Crohn’s disease GWA studies
dataset from the Wellcome Trust Case Control
Consortium (WTCCC) that includes 4,680 individuals with 500,000 SNPs’
genotypes was analyzed using logic regression on a computer cluster
to evaluate SNPInt performance. The results
from SNPInt with any number of CPUs
are the same as the results from non-parallel approach, and SNPInt
library quite accelerated the logic regression analysis. For instance,
with two hundred genes and twenty permutation rounds, the computing
time was continuously decreased from 7.3 days to only 0.9 day when
SNPInt
applied eight CPUs. Executing analyses of SNP-SNP interactions
using the SNPInt library is an effective
way to boost performance, and simplify the parallelization of analyses
of SNP-SNP interactions.
Keywords: Crohn’s disease
GWA studies; logic regression; parallel computing; R; SNP-SNP
interactions
ABSTRAK
Interaksi SNP-SNP telah diiktiraf penting pada dasarnya untuk
memahami punca
genetik sifat penyakit
kompleks. Regresi logik adalah satu
kaedah yang berkesan
untuk mengenal pasti interaksi SNP-SNP yang
dikaitkan dengan
risiko penyakit kompleks. Walau bagaimanapun, mengenal
pasti interaksi
SNP-SNP
adalah mencabar
secara pengkomputeran
dan mungkin mengambil
masa berjam, berminggu
dan berbulan untuk
diselesaikan. Walaupun
pengkomputeran selari
adalah satu kaedah
berkuasa untuk
mempercepatkan masa pengiraan, ia adalah sukar
bagi pengguna
untuk menggunakan kaedah ini dalam
analisis regresi
logik interaksi SNP-SNP kerana ia memerlukan
kemahiran pengaturcaraan
lanjutan untuk pemetakan dan pengagihan
data dengan betul,
mengawal dan memantau
tugas pelbagai
teras CPU atau beberapa komputer dan menggabungkan fail output. Dalam kertas ini, kami memberikan R-perpustakaan novel
yang disebut SNPInt
untuk secara
automatik mempercepatkan analisis interaksi SNP-SNP
kajian sekutuan
genom-menyeluruh (GWA)
menggunakan pengkomputeran selari tanpa kemahiran
pengaturcaraan lanjutan.
Kajian dataset penyakit
Crohn GWA daripada Wellcome Trust Case Control Consortium (WTCCC)
yang merangkumi 4,680 individu
dengan 500,000 SNP genotip
telah dianalisis
menggunakan regresi logik pada kelompok
komputer untuk
menilai prestasi SNPInt. Hasil daripada SNPInt
dengan apa-apa
bilangan CPU adalah
sama seperti
hasil daripada
pendekatan bebas-selari dan perpustakaan SNPInt mempercepatkan analisis regresi logik. Sebagai contoh, dengan dua ratus gen dan
dua puluh pusingan permutasi, masa pengiraan berterusan menurun daripada 7.3 hari kepada 0.9 hari sahaja apabila
SNPInt menggunakan
lapan CPU. Analisis
pelaksanaan interaksi
SNP-SNP
menggunakan perpustakaan
SNPInt adalah
merupakan satu
cara yang berkesan
untuk meningkatkan
prestasi dan memudahkan
keselarian analisis
interaksi SNP-SNP.
Kata kunci: Interaksi
SNP-SNP; kajian penyakit Crohn GWA; pengiraan selari; R; regresi logik
RUJUKAN
Aulchenko, Y.S., Ripke, S., Isaacs, A. & van
Duijn, C.M. 2007. GenABEL: An R library for genome-wide association analysis.
Bioinformatics 23(10): 1294-1296.
Breiman, L. 2001. Random forests. Machine Learning 45(1): 5-32.
Breiman, L. 1984. Classification
and Regression Trees, The
Wadsworth Statistics/Probability Series. Belmont, Calif.: Wadsworth
International Group.
Browning, B.L. & Browning, S.R. 2008. Haplotypic analysis of wellcome trust case control consortium
data. Hum. Genet. 123(3): 273-280.
Dinu, I., Mahasirimongkol,
S., Liu, Q., Yanai, H., Sharaf
Eldin, N., Kreiter,
E., Wu, X., Jabbari, S., Tokunaga, K. & Yasui,
Y. 2012. SNP-SNP interactions discovered by logic regression explain
Crohn’s disease genetics. PLoS
One 7(10): e43035.
Garte, S. 2001. Metabolic susceptibility genes as cancer risk factors:
Time for a reassessment? Cancer Epidemiol
Biomarkers Prev. 10(12): 1233-1237.
Guyon,
I., Weston, J., Branhill, S. & Vapnik, V. 2002.
Gene selection for cancer classification using
support vector machines. Machine Learning 46: 389-422.
Ihaka,
R. & Gentleman, R. 1996. R: A language for data analysis
and graphics. Journal of Computational and Graphical Statistics
5(3): 299-314.
Parkes,
M., Barrett, J.C., Prescott, N.J., Tremelling,
M., Anderson, C.A., Fisher, S.A., Roberts, R.G., Nimmo,
E.R., Cummings, F.R., Soars, D., Drummond, H., Lees, C.W., Khawaja,
S.A., Bagnall, R., Burke, D.A., Todhunter,
C.E., Ahmad, T., Onnie, C.M., McArdle,
W., Strachan, D., Bethel, G., Bryan, C., Lewis, C.M., Deloukas,
P., Forbes, A., Sanderson, J., Jewell, D.P., Satsangi,
J., Mansfield, J.C., Cardon, L. &
Mathew, C.G. 2007. Sequence variants in the autophagy gene IRGM
and multiple other replicating loci contribute to Crohn’s disease
susceptibility. Nat. Genet. 39(7): 830-832.
Ruczinski,
I., Kooperberg, C. & LeBlanc, M. 2003. Logic regression. Journal of Computational and Graphical
Statistics 12(3): 475-511.
Sangket, U.,
Mahasirimongkol, S., Chantratita,
W., Tandayya, P., & Aulchenko,
Y.S. 2010. ParallABEL: an R library for
generalized parallelization of genome-wide association studies.
BMC Bioinformatics 11: 217.
Schwender,
H. & Ickstadt, K. 2008. Identification of SNP interactions using logic regression.
Biostatistics 9(1): 187- 198.
Wan,
X., Yang, C., Yang, Q., Xue, H., Fan,
X., Tang, N.L. & Yu, W. 2010. BOOST: A fast approach to detecting
gene-gene interactions in genome-wide case-control studies. Am.
J. Hum. Genet. 87(3): 325-340.
WTCCC. 2007.
Genome-wide association study of 14,000 cases of seven common diseases
and 3,000 shared controls. Nature 447(7145): 661-678.
*Pengarang
untuk surat-menyurat;
email: unitsa.s@psu.ac.th
|