knitr::opts_chunk$set(echo = TRUE)
library(BEDMatrix)

(draft v3)


Purpose

Calculate plolygenic risk score (PRS) by using the BEDMatrix library to read Plink .bed files into R.

Example data

Example data, base & target, is the same used in the "Basic Tutorial for Polygenic Risk Analysis" by one of the authors of PRSice.


File Name | Data Type | Description ---------------------- | --------- | ------------------ Height.QC.Transformed | Base | The post-QCed and transformed summary statistics EUR.QC.bed | Target | The genotype file after performing some basic filtering EUR.QC.bim | Target | This file contains the SNPs that passed the basic filtering EUR.QC.fam | Target | This file contains the samples that passed the basic filtering


1) Read the base data file into R.

base_data <- read.table("/Volumes/ifs/Rprojects/PRS/PRS_tutorial/3_Calculating_PRS/Height.QC.Transformed", 
                        header = T
                      )
head(base_data)


2) Read the target data file into R using the BEDMatrix library.

target_data <- BEDMatrix("/Volumes/ifs/Rprojects/PRS/PRS_tutorial/2_Obtaining-target-data/EUR.QC.bed")

BEDMatrix objects are created in R by providing the path to a .bed file and once created, they behave similarly to regular matrices. Genotypes are retrieved on demand without loading the entire file into memory.


The dimensions of 'target_data' are:

dim(target_data)

To view a subset of the BEDMatrix object for the target data:

target_data[1:3, 1:5]

Important Note- The subsets extracted from a BEDMatrix object are coded similarly to .raw files (generated with the --recodeA argument in Plink): 0 indicates homozygous major allele, 1 indicates heterozygous, and 2 indicates homozygous minor allele.


3) Create a small subset of 'target_data' for this exercise:

target_data_subset <- target_data[1:50, 1:1000]
class(target_data_subset)

4) Rename columns in 'target_data_subset' so they match format

of rsIDs in 'base_data'

colnames(target_data_subset) = gsub(pattern = "_.", 
                                    replacement = "", 
                                    x = colnames(target_data_subset)
                                   )
target_data_subset[1:3, 1:5]

5) Transpose 'target_data_subset' so that the rows are rsIDs and columns are

sample IDs, same as 'base_data'

target_data_subset <- t(target_data_subset)
target_data_subset[1:5, 1:3]
head(base_data)

6) Subset 'base_data' with rsID's and OR (weight) only.

base_data_subset <- data.frame(SNP = base_data$SNP, OR = base_data$OR)
head(base_data_subset)
class(base_data_subset)

7) The list of variants in 'base_data_subset' and 'target_data_subset' need to match.

base_data_subset_match <- base_data_subset[base_data_subset$SNP %in% row.names(target_data_subset),]
target_data_subset_match <- target_data_subset[rownames(target_data_subset) %in% as.character(base_data_subset_match$SNP),]
head(base_data_subset_match, n = 10)
target_data_subset_match[1:10, 1:5]

The number of rows/variants in each should be the equal.

nrow(base_data_subset_match)
nrow(target_data_subset_match)

Now we have a matrix, 'target_data_subset_match', w/ SNP-score (0, 1 or 2), and

a data.frame, 'base_data_subset_match', with the weight (transformed OR),

that have the same list of variants.

Calculate PRS

8) Multiply the 'weight' by the SNP-score

df_product <- data.frame(apply(target_data_subset_match, 
                               2, 
                               function(x) x * base_data_subset_match$OR)
                              )
df_product[1:3, 1:5]

9) Calculate the PRS for each individual using the base R function 'colSums'

df_prs <- colSums(df_product)
head(df_prs, n = 5)


References

Marees AT, de Kluiver H, Stringer S, Vorspan F, Curis E, Marie‐Claire C, Derks EM. A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis. International journal of methods in psychiatric research. 2018 Jun;27(2):e1608.

Choi SW, Mak TS, O'reilly P. A guide to performing Polygenic Risk Score analyses. BioRxiv. 2018 Jan 1:416545.

Grueneberg, A., and G. de los Campos, 2019 BGData - A Suite of R Packages for Genomic Analysis with Big Data. G3: Genes, Genomes, Genetics 9(5): 1377-1383.

"Basic Tutorial for Polygenic Risk Score Analysis" https://choishingwan.github.io/PRS-Tutorial/target/

https://nicercode.github.io/guides/repeating-things/



timyers/ImPRS documentation built on Oct. 6, 2020, 6:26 p.m.