SAINS MALAYSIANA

Sains Malaysiana 46(9)(2017): 1449–1455

http://dx.doi.org/10.17576/jsm-2017-4609-13

Parallelization of Logic Regression Analysis on SNP-SNP Interactions of a Crohn’s Disease Dataset Model

(Analisis Regresi Logik Keselarian pada Interaksi SNP-SNP Model Dataset Penyakit Crohn)

UNITSA SANGKET^1*, SURAKAMETH MAHASIRIMONGKOL², PICHAYA TANDAYYA³, SURASAK SANGKHATHAT⁴, WASUN CHANTRATITA⁵, QI LIU⁶ & YUTAKA YASUI⁶

¹Department of Molecular Biotechnology and Bioinformatics, Center for Genomics and Bioinformatics Research, Faculty of Science, Prince of Songkla University, Songkhla,

90112 Thailand

²Medical Genetic Section, National Institute of Health, Department of Medical Sciences

Ministry of Public Health, Nonthaburi, 11000 Thailand

³Department of Computer Engineering, Faculty of Engineering, Prince of Songkla University

Songkhla, 90112 Thailand

⁴Tumor Biology Research Unit, Department of Surgery, Faculty of Medicine, Prince of Songkla University, Songkhla, 90112 Thailand

⁵Department of Pathology, Faculty of Medicine, Ramathibodhi Hospital, Mahidol University, Bangkok, 10400 Thailand

⁶Department of Public Health Sciences, School of Public Health, University of Alberta, Edmonton, Alberta, Canada

Received: 31 August 2016/Accepted: 17 January 2017

ABSTRACT

SNP-SNP interactions have been recognized to be basically important for understanding genetic causes of complex disease traits. Logic regression is an effective methods for identifying SNP-SNP interactions associated with risk of complex disease. However, identifying SNP-SNP interactions are computationally challenging and may take hours, weeks and months to complete. Although parallel computing is a powerful method to accelerate computing time, it is arduous for users to apply this method to logic regression analyses of SNP-SNP interactions because it requires advanced programming skills to correctly partition and distribute data, control and monitor tasks across multi-core CPUs or several computers, and merge output files. In this paper, we present a novel R-library called SNPInt to automatically speed up analyses of SNP-SNP interactions of genome-wide association (GWA) studies using parallel computing without the advanced programming skills. The Crohn’s disease GWA studies dataset from the Wellcome Trust Case Control Consortium (WTCCC) that includes 4,680 individuals with 500,000 SNPs’ genotypes was analyzed using logic regression on a computer cluster to evaluate SNPInt performance. The results from SNPInt with any number of CPUs are the same as the results from non-parallel approach, and SNPInt library quite accelerated the logic regression analysis. For instance, with two hundred genes and twenty permutation rounds, the computing time was continuously decreased from 7.3 days to only 0.9 day when SNPInt applied eight CPUs. Executing analyses of SNP-SNP interactions using the SNPInt library is an effective way to boost performance, and simplify the parallelization of analyses of SNP-SNP interactions.

Keywords: Crohn's disease GWA studies; logic regression; parallel computing; R; SNP-SNP interactions

ABSTRAK

Interaksi SNP-SNP telah diiktiraf penting pada dasarnya untuk memahami punca genetik sifat penyakit kompleks. Regresi logik adalah satu kaedah yang berkesan untuk mengenal pasti interaksi SNP-SNP yang dikaitkan dengan risiko penyakit kompleks. Walau bagaimanapun, mengenal pasti interaksi SNP-SNP adalah mencabar secara pengkomputeran dan mungkin mengambil masa berjam, berminggu dan berbulan untuk diselesaikan. Walaupun pengkomputeran selari adalah satu kaedah berkuasa untuk mempercepatkan masa pengiraan, ia adalah sukar bagi pengguna untuk menggunakan kaedah ini dalam analisis regresi logik interaksi SNP-SNP kerana ia memerlukan kemahiran pengaturcaraan lanjutan untuk pemetakan dan pengagihan data dengan betul, mengawal dan memantau tugas pelbagai teras CPU atau beberapa komputer dan menggabungkan fail output. Dalam kertas ini, kami memberikan R-perpustakaan novel yang disebut SNPInt untuk secara automatik mempercepatkan analisis interaksi SNP-SNP kajian sekutuan genom-menyeluruh (GWA) menggunakan pengkomputeran selari tanpa kemahiran pengaturcaraan lanjutan. Kajian dataset penyakit Crohn GWA daripada Wellcome Trust Case Control Consortium (WTCCC) yang merangkumi 4,680 individu dengan 500,000 SNP genotip telah dianalisis menggunakan regresi logik pada kelompok komputer untuk menilai prestasi SNPInt. Hasil daripada SNPInt dengan apa-apa bilangan CPU adalah sama seperti hasil daripada pendekatan bebas-selari dan perpustakaan SNPInt mempercepatkan analisis regresi logik. Sebagai contoh, dengan dua ratus gen dan dua puluh pusingan permutasi, masa pengiraan berterusan menurun daripada 7.3 hari kepada 0.9 hari sahaja apabila SNPInt menggunakan lapan CPU. Analisis pelaksanaan interaksi SNP-SNP menggunakan perpustakaan SNPInt adalah merupakan satu cara yang berkesan untuk meningkatkan prestasi dan memudahkan keselarian analisis interaksi SNP-SNP.

Kata kunci: Interaksi SNP-SNP; kajian penyakit Crohn GWA; pengiraan selari; R; regresi logik

REFERENCES

Aulchenko, Y.S., Ripke, S., Isaacs, A. & van Duijn, C.M. 2007. GenABEL: An R library for genome-wide association analysis. Bioinformatics 23(10): 1294-1296.

Breiman, L. 2001. Random forests. Machine Learning 45(1): 5-32.

Breiman, L. 1984. Classification and Regression Trees, The Wadsworth Statistics/Probability Series. Belmont, Calif.: Wadsworth International Group.

Browning, B.L. & Browning, S.R. 2008. Haplotypic analysis of wellcome trust case control consortium data. Hum. Genet. 123(3): 273-280.

Dinu, I., Mahasirimongkol, S., Liu, Q., Yanai, H., Sharaf Eldin, N., Kreiter, E., Wu, X., Jabbari, S., Tokunaga, K. & Yasui, Y. 2012. SNP-SNP interactions discovered by logic regression explain Crohn’s disease genetics. PLoS One 7(10): e43035.

Garte, S. 2001. Metabolic susceptibility genes as cancer risk factors: Time for a reassessment? Cancer Epidemiol Biomarkers Prev. 10(12): 1233-1237.

Guyon, I., Weston, J., Branhill, S. & Vapnik, V. 2002. Gene selection for cancer classification using support vector machines. Machine Learning 46: 389-422.

Ihaka, R. & Gentleman, R. 1996. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 5(3): 299-314.

Parkes, M., Barrett, J.C., Prescott, N.J., Tremelling, M., Anderson, C.A., Fisher, S.A., Roberts, R.G., Nimmo, E.R., Cummings, F.R., Soars, D., Drummond, H., Lees, C.W., Khawaja, S.A., Bagnall, R., Burke, D.A., Todhunter, C.E., Ahmad, T., Onnie, C.M., McArdle, W., Strachan, D., Bethel, G., Bryan, C., Lewis, C.M., Deloukas, P., Forbes, A., Sanderson, J., Jewell, D.P., Satsangi, J., Mansfield, J.C., Cardon, L. & Mathew, C.G. 2007. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn’s disease susceptibility. Nat. Genet. 39(7): 830-832.

Ruczinski, I., Kooperberg, C. & LeBlanc, M. 2003. Logic regression. Journal of Computational and Graphical Statistics 12(3): 475-511.

Sangket, U., Mahasirimongkol, S., Chantratita, W., Tandayya, P., & Aulchenko, Y.S. 2010. ParallABEL: an R library for generalized parallelization of genome-wide association studies. BMC Bioinformatics 11: 217.

Schwender, H. & Ickstadt, K. 2008. Identification of SNP interactions using logic regression. Biostatistics 9(1): 187- 198.

Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N.L. & Yu, W. 2010. BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet. 87(3): 325-340.

WTCCC. 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447(7145): 661-678.

*Corresponding author; email: unitsa.s@psu.ac.th