knitr::opts_chunk$set(echo = TRUE) library(BEDMatrix)
Calculate plolygenic risk score (PRS) by using the BEDMatrix library to read Plink .bed files into R.
Example data, base & target, is the same used in the "Basic Tutorial for Polygenic Risk Analysis" by one of the authors of PRSice.
A modified version of the Height GWAS summary statistics generated by the GIANT consortium, with QC, ambiguous SNPs removed, etc.
The Target data
File Name | Data Type | Description ---------------------- | --------- | ------------------ Height.QC.Transformed | Base | The post-QCed and transformed summary statistics EUR.QC.bed | Target | The genotype file after performing some basic filtering EUR.QC.bim | Target | This file contains the SNPs that passed the basic filtering EUR.QC.fam | Target | This file contains the samples that passed the basic filtering
base_data <- read.table("/Volumes/ifs/Rprojects/PRS/PRS_tutorial/3_Calculating_PRS/Height.QC.Transformed", header = T ) head(base_data)
target_data <- BEDMatrix("/Volumes/ifs/Rprojects/PRS/PRS_tutorial/2_Obtaining-target-data/EUR.QC.bed")
BEDMatrix objects are created in R by providing the path to a .bed file and once created, they behave similarly to regular matrices. Genotypes are retrieved on demand without loading the entire file into memory.
dim(target_data)
target_data[1:3, 1:5]
Important Note- The subsets extracted from a BEDMatrix object are coded similarly to .raw files (generated with the --recodeA argument in Plink): 0 indicates homozygous major allele, 1 indicates heterozygous, and 2 indicates homozygous minor allele.
target_data_subset <- target_data[1:50, 1:1000] class(target_data_subset)
colnames(target_data_subset) = gsub(pattern = "_.", replacement = "", x = colnames(target_data_subset) ) target_data_subset[1:3, 1:5]
target_data_subset <- t(target_data_subset) target_data_subset[1:5, 1:3] head(base_data)
base_data_subset <- data.frame(SNP = base_data$SNP, OR = base_data$OR) head(base_data_subset) class(base_data_subset)
base_data_subset_match <- base_data_subset[base_data_subset$SNP %in% row.names(target_data_subset),]
target_data_subset_match <- target_data_subset[rownames(target_data_subset) %in% as.character(base_data_subset_match$SNP),]
head(base_data_subset_match, n = 10) target_data_subset_match[1:10, 1:5]
nrow(base_data_subset_match) nrow(target_data_subset_match)
df_product <- data.frame(apply(target_data_subset_match, 2, function(x) x * base_data_subset_match$OR) ) df_product[1:3, 1:5]
df_prs <- colSums(df_product) head(df_prs, n = 5)
Marees AT, de Kluiver H, Stringer S, Vorspan F, Curis E, Marie‐Claire C, Derks EM. A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis. International journal of methods in psychiatric research. 2018 Jun;27(2):e1608.
Choi SW, Mak TS, O'reilly P. A guide to performing Polygenic Risk Score analyses. BioRxiv. 2018 Jan 1:416545.
Grueneberg, A., and G. de los Campos, 2019 BGData - A Suite of R Packages for Genomic Analysis with Big Data. G3: Genes, Genomes, Genetics 9(5): 1377-1383.
"Basic Tutorial for Polygenic Risk Score Analysis" https://choishingwan.github.io/PRS-Tutorial/target/
https://nicercode.github.io/guides/repeating-things/
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.