Tuning Spline Smoothing Parameters in GWAS Using Replication-Based Approach

Authors

  • Chanunya Pailoung Department of Applied Statistics, Faculty of Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand
  • Pianpool Kamoljitprapa Department of Applied Statistics, Faculty of Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok, Thailand
  • Sirikanlaya Sookkhee Department of Mathematics, Faculty of Education, Sisaket Rajabhat University, Si Sa Ket, Thailand

Keywords:

GWAS, replication, sequence kernel association test, smoothing parameter, spline regression

Abstract

Genome-Wide Association Study (GWAS) is an approach for identifying the associations between genetic variants, especially Single Nucleotide Polymorphisms (SNPs), and phenotypes, such as disease risk. GWAS can be conducted either on a single SNP or groups of SNPs. However, analyzing the GWAS data can be challenging due to its high dimensionality, leading to an inflation of type I error rate and computational burdens when conducting multiple hypotheses testing. To address these limitations, this research investigates the association between SNP sets, grouped by gene, and the risk of Crohn's disease. The Sequence Kernel Association Test (SKAT) is employed to assess these associations, while spline regression analysis is used to construct the model and reduce analytical complexity. This research aims to obtain the optimal smoothing parameters, particularly the degree of freedom, for the spline regression model and the optimal number of replications for simulated data, and to apply the optimal model for identifying gene regions associated with Crohn's disease. The results indicate that the degree of freedom of 1,000 is the optimal parameter for the spline regression model, as it provides the lowest false positive rate while maintaining a reasonable true positive rate. Additionally, 1,000 replicates have been identified as the optimal number of replications, as this value ensures the most efficient processing time. Ultimately, the optimized model can effectively identify gene regions associated with Crohn’s disease while minimizing the error rate and conserving computational resources during the analysis of extensive data.

References

Ashton JJ, Seaby EG, Beattie RM, Ennis S. NOD2 in Crohn’s disease-unfinished business. J Crohns Colitis. 2023; 17(3): 450-458.

Bates DM, Venables WN. An R-package for Regression Spline Functions and Classes version 4.2.1. [monograph on the Internet]. 2020 [cited 2023 Sep 18]. Available from: https://stat.ethz.ch/R-manual/R-devel/library/ splines/html/00 Index.html

Berger D. A Gentle Introduction To Resampling Techniques [monograph on the Internet]. Claremont Graduate University. 2011 [cited 2023 Nov 20]. Available from: https://www.academia.edu /66608980/A_ Gentle_Introduction_to_Resampling_Techniques.

Burton PR, Clayton DG, Cardon LR, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007; 447(7145): 661-678.

James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning: with applications in R. 2nd ed. New York: Springer; 2021.

Kamoljitprapa P, Baksh FM, De Gaetano A, Polsen O, Leelasilapasart P. Statistical study design for analyzing multiple Gene Loci correlation in DNA sequences. Mathematics. 2023; 11(23): 4710.

Kamoljitprapa P, Leelasilapasart P. Nonlinear models for Influenza patients for different age groups in Thailand. ICIEI2024: Proceedings of the 9th International Conference on Information and Education Innovations; 2024 Apr 12-14; Verbania, Italy: Association for Computing Machinery; 2024. pp. 109-112.

Kamoljitprapa P, Polsen O, Sookkhee S. Statistical analysis for genome data based on multiple SNPs using kernel machine based test. In: Proceedings of the 5th Research, Invention, and Innovation Congress (RI2C2024); 2024 Aug 8-9; Bangkok, Thailand. p. 262-266.

Kido T, Sikora-Wohlfeld W, Kawashima M, Kikuchi S, Kamatani N, Patwardhan A, et al. Are minor alleles more likely to be risk alleles? BMC Med Genomics. 2018; 11(1): 3.

Kirdwichai P. An efficient association test for high dimensional data, with application in genetic studies. In: Proceedings of the World Congress on Engineering (WCE2016), Vol II; 2016 Jun 29-Jul 1; London, UK. p. 618-622.

Kirdwichai P. Estimation and use of correlation in multiple hypothesis testing with high dimensional data. In: Proceedings of the 2nd International Conference on Mathematics and Statistics (ICOMS2019); 2019; Prague, Czech Republic. New York: Association for Computing Machinery; 2019. p. 36-39.

Koskan O, Ergin M, Koknaroglu H. Determination of suitable sample size and number of simulations (resampling) for predicting dry matter intake of feedlot cattle. Int J Nat Eng Sci. 2023; 17: 27-36.

Lee S, Zhao Z. An R-package for SNP-Set (Sequence) Kernel Association Test version 2.2.5 [monograph on the Internet]. 2023 [cited 2023 Sep 27]. Available from: https://cran.r-project.org/web/packages/SKAT /index.html

Mundfrom D, Schaffer J, Kim M-J, Shaw D, Thongteeraparp A, Preecha P, et al. Number of replications required in Monte Carlo simulation studies: A synthesis of four studies. J Mod Appl Stat Methods. 2011; 10: 19-28.

Pailoung C, Kamoljitprapa P, Sookkhee S. Optimization of Smoothing Parameters in Splines in GWAS Using a Replication Strategy. RI2C2024: Proceedings of the 5th Research, Invention, and Innovation Congress; 2024 Aug 8-9; Bangkok, Thailand. pp. 188-192.

Perperoglou A, Sauerbrei W, Abrahamowicz M, Schmid M. A review of spline function procedures in R. BMC Med Res Methodol. 2019; 19(1): 46.

Roda G, Chien Ng S, Kotze PG, Argollo M, Panaccione R, Spinelli A, et al. Crohn’s disease. Nat Rev Dis Primers. 2020; 6(1): 22.

R Core Team. R: A Language and Environment for Statistical Computing [monograph on the Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2022 [cited 2023 Nov 25]. Available from: https://www.R-project.org.

Sookkhee S, Baksh F M, Kirdwichai P. Efficiency of Single SNP analysis and Sequence Kernel Association Test in Genome-wide Association Analysis. IMECS2018: Proceeding of the 18th International MultiConference of Engineers and Computer Scientists; 2018 Mar 14-16; Hong Kong. pp. 308-313.

Sookkhee S, Kirdwichai P, Baksh F. The optimal parameters of spline regression for SNP-set analysis in genome-wide association study. Sci Technol Asia. 2021; 26(1): 39-52.

Sookkhee S, Kirdwichai P, Baksh F. The efficiency of SNP and SNP-set analysis in genome-wide association studies. Songklanakarin J Sci Technol. 2021; 43(1): 243-251.

Sukhumsirichart W. Polymorphisms. In: Yamin L, editor. Genetic Diversity and Disease Susceptibility [serial on the Internet]. InTech; 2018. Available from: http://doi.org/10.5772/ intechopen.76728

Wu M, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011; 89(1): 82-93.

Downloads

Published

2025-06-24

How to Cite

Pailoung, C. ., Kamoljitprapa, P. ., & Sookkhee, S. . (2025). Tuning Spline Smoothing Parameters in GWAS Using Replication-Based Approach. Thailand Statistician, 23(3), 643–656. retrieved from https://ph02.tci-thaijo.org/index.php/thaistat/article/view/259938

Issue

Section

Articles