Sains Malaysiana 46(9)(2017): 1449–1455
http://dx.doi.org/10.17576/jsm-2017-4609-13
Parallelization of Logic Regression Analysis on SNP-SNP
Interactions of a Crohn’s Disease Dataset Model
(Analisis Regresi Logik Keselarian pada Interaksi
SNP-SNP Model Dataset Penyakit Crohn)
UNITSA SANGKET1*, SURAKAMETH MAHASIRIMONGKOL2, PICHAYA TANDAYYA3, SURASAK SANGKHATHAT4, WASUN CHANTRATITA5, QI LIU6 & YUTAKA YASUI6
1Department of Molecular
Biotechnology and Bioinformatics, Center for Genomics and Bioinformatics
Research, Faculty of Science, Prince of Songkla University, Songkhla,
90112
Thailand
2Medical Genetic Section,
National Institute of Health, Department of Medical Sciences
Ministry of Public
Health, Nonthaburi, 11000 Thailand
3Department of Computer
Engineering, Faculty of Engineering, Prince of Songkla University
Songkhla, 90112 Thailand
4Tumor Biology Research
Unit, Department of Surgery, Faculty of Medicine, Prince of Songkla University,
Songkhla, 90112 Thailand
5Department of Pathology,
Faculty of Medicine, Ramathibodhi Hospital, Mahidol University, Bangkok, 10400 Thailand
6Department of Public
Health Sciences, School of Public Health, University of Alberta, Edmonton,
Alberta, Canada
Received: 31 August 2016/Accepted:
17 January 2017
ABSTRACT
SNP-SNP interactions
have been recognized to be basically important for understanding genetic causes
of complex disease traits. Logic regression is an effective methods for
identifying SNP-SNP interactions associated with risk of complex
disease. However, identifying SNP-SNP interactions are
computationally challenging and may take hours, weeks and months to complete.
Although parallel computing is a powerful method to accelerate computing time,
it is arduous for users to apply this method to logic regression analyses of SNP-SNP interactions because it requires advanced programming skills to
correctly partition and distribute data, control and monitor tasks across
multi-core CPUs or several computers, and merge output files. In
this paper, we present a novel R-library called SNPInt
to automatically speed up analyses of SNP-SNP interactions of
genome-wide association (GWA) studies using parallel computing
without the advanced programming skills. The Crohn’s disease GWA studies
dataset from the Wellcome Trust Case Control Consortium (WTCCC)
that includes 4,680 individuals with 500,000 SNPs’
genotypes was analyzed using logic regression on a computer cluster to evaluate SNPInt
performance. The results from SNPInt with any number of CPUs
are the same as the results from non-parallel approach, and SNPInt
library quite accelerated the logic regression analysis. For instance, with two
hundred genes and twenty permutation rounds, the computing time was continuously
decreased from 7.3 days to only 0.9 day when SNPInt
applied eight CPUs. Executing analyses of SNP-SNP interactions
using the SNPInt library is an effective way to boost performance,
and simplify the parallelization of analyses of SNP-SNP interactions.
Keywords: Crohn's disease GWA
studies; logic regression; parallel computing; R;
SNP-SNP interactions
ABSTRAK
Interaksi SNP-SNP telah
diiktiraf penting pada dasarnya untuk memahami punca genetik sifat penyakit
kompleks. Regresi logik adalah satu kaedah yang berkesan untuk mengenal pasti
interaksi SNP-SNP yang dikaitkan dengan risiko penyakit kompleks.
Walau bagaimanapun, mengenal pasti interaksi SNP-SNP adalah
mencabar secara pengkomputeran dan mungkin mengambil masa berjam, berminggu dan
berbulan untuk diselesaikan. Walaupun pengkomputeran selari adalah satu kaedah
berkuasa untuk mempercepatkan masa pengiraan, ia adalah sukar bagi pengguna
untuk menggunakan kaedah ini dalam analisis regresi logik interaksi SNP-SNP kerana ia memerlukan kemahiran pengaturcaraan lanjutan untuk
pemetakan dan pengagihan data dengan betul, mengawal dan memantau tugas
pelbagai teras CPU atau beberapa komputer dan
menggabungkan fail output. Dalam kertas ini, kami memberikan R-perpustakaan novel
yang disebut SNPInt untuk secara automatik mempercepatkan analisis
interaksi SNP-SNP kajian sekutuan genom-menyeluruh (GWA)
menggunakan pengkomputeran selari tanpa kemahiran pengaturcaraan lanjutan.
Kajian dataset penyakit Crohn GWA daripada Wellcome Trust
Case Control Consortium (WTCCC) yang merangkumi 4,680 individu
dengan 500,000 SNP genotip telah dianalisis
menggunakan regresi logik pada kelompok komputer untuk menilai prestasi SNPInt.
Hasil daripada SNPInt dengan apa-apa bilangan CPU adalah
sama seperti hasil daripada pendekatan bebas-selari dan perpustakaan SNPInt
mempercepatkan analisis regresi logik. Sebagai contoh, dengan dua ratus gen dan
dua puluh pusingan permutasi, masa pengiraan berterusan menurun daripada 7.3
hari kepada 0.9 hari sahaja apabila SNPInt menggunakan lapan CPU.
Analisis pelaksanaan interaksi SNP-SNP menggunakan
perpustakaan SNPInt adalah merupakan satu cara yang berkesan untuk
meningkatkan prestasi dan memudahkan keselarian analisis interaksi SNP-SNP.
Kata
kunci: Interaksi SNP-SNP; kajian
penyakit Crohn GWA; pengiraan
selari; R; regresi logik
REFERENCES
Aulchenko,
Y.S., Ripke, S., Isaacs, A. & van Duijn, C.M. 2007. GenABEL: An R library
for genome-wide association analysis. Bioinformatics 23(10): 1294-1296.
Breiman, L.
2001. Random forests. Machine Learning 45(1): 5-32.
Breiman, L.
1984. Classification and Regression Trees, The Wadsworth
Statistics/Probability Series. Belmont, Calif.: Wadsworth International
Group.
Browning,
B.L. & Browning, S.R. 2008. Haplotypic analysis of wellcome trust case
control consortium data. Hum. Genet. 123(3): 273-280.
Dinu, I.,
Mahasirimongkol, S., Liu, Q., Yanai, H., Sharaf Eldin, N., Kreiter, E., Wu, X.,
Jabbari, S., Tokunaga, K. & Yasui, Y. 2012. SNP-SNP interactions discovered
by logic regression explain Crohn’s disease genetics. PLoS One 7(10):
e43035.
Garte, S. 2001. Metabolic
susceptibility genes as cancer risk factors: Time for a reassessment? Cancer
Epidemiol Biomarkers Prev. 10(12): 1233-1237.
Guyon, I., Weston, J., Branhill, S. & Vapnik, V. 2002.
Gene selection for cancer classification using support vector machines. Machine
Learning 46: 389-422.
Ihaka, R. &
Gentleman, R. 1996. R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics 5(3): 299-314.
Parkes, M., Barrett,
J.C., Prescott, N.J., Tremelling, M., Anderson, C.A., Fisher, S.A., Roberts,
R.G., Nimmo, E.R., Cummings, F.R., Soars, D., Drummond, H., Lees, C.W.,
Khawaja, S.A., Bagnall, R., Burke, D.A., Todhunter, C.E., Ahmad, T., Onnie,
C.M., McArdle, W., Strachan, D., Bethel, G., Bryan, C., Lewis, C.M., Deloukas,
P., Forbes, A., Sanderson, J., Jewell, D.P., Satsangi, J., Mansfield, J.C.,
Cardon, L. & Mathew, C.G. 2007. Sequence variants in the autophagy gene
IRGM and multiple other replicating loci contribute to Crohn’s disease
susceptibility. Nat. Genet. 39(7): 830-832.
Ruczinski, I.,
Kooperberg, C. & LeBlanc, M. 2003. Logic regression. Journal of
Computational and Graphical Statistics 12(3): 475-511.
Sangket, U.,
Mahasirimongkol, S., Chantratita, W., Tandayya, P., & Aulchenko, Y.S. 2010.
ParallABEL: an R library for generalized parallelization of genome-wide
association studies. BMC Bioinformatics 11: 217.
Schwender, H. &
Ickstadt, K. 2008. Identification of SNP interactions using logic regression. Biostatistics 9(1): 187- 198.
Wan, X., Yang, C., Yang,
Q., Xue, H., Fan, X., Tang, N.L. & Yu, W. 2010. BOOST: A fast approach to
detecting gene-gene interactions in genome-wide case-control studies. Am. J.
Hum. Genet. 87(3): 325-340.
WTCCC. 2007. Genome-wide
association study of 14,000 cases of seven common diseases and 3,000 shared
controls. Nature 447(7145): 661-678.
*Corresponding author; email: unitsa.s@psu.ac.th