options(htmltools.dir.version = FALSE, width = 75, max.print = 30)
knitr::opts_knit$set(global.par = TRUE, root.dir = "..")
knitr::opts_chunk$set(echo = TRUE, fig.align = 'center', dev = 'png', out.width = "90%")

If you have individual-level data (i.e. genotypes and phenotypes), you can basically use any supervised learning (machine learning) method to train a PGS. However, because of the size of the genetic data, you will quickly have scalability issues with these models. Moreover, it has been shown that effects for most diseases and traits are small and essentially additive, and that fancy methods such as deep learning are not much effective at constructing PGS [@kelemen2025performance].

Therefore, using penalized linear/logistic regression (PLR) can be a very efficient and effective method to train PGS. In my R package bigstatsr, I have developed a very fast implementation with automatic choice of the two hyper-parameters [@prive2019efficient]. You can find a tutorial explaining its implementation and use here.

This is an example of using PLR for predicting height from genotypes in the UK Biobank

knitr::include_graphics("https://privefl.github.io/blog/images/UKB-final-pred.png")

References



privefl/mypack documentation built on July 5, 2025, 12:18 a.m.