Genetic Association Analysis

set.seed(0)
knitr::opts_chunk$set(
  out.extra = 'style="display:block; margin: auto"',
  fig.align = "center",
  fig.height = 8,
  fig.path = "gaawr2/",
  fig.width = 8,
  collapse = TRUE,
  comment = "#>",
  dev = "CairoPNG")
pkgs <- c("EnsDb.Hsapiens.v75","ensembldb","GMMAT","HardyWeinberg","MCMCglmm","SNPassoc","biomaRt",
           "gap","gap.datasets","haplo.stats","powerEQTL","R2jags","regress",
           "dplyr","ggplot2","httr","jsonlite","kableExtra","knitr","tidyr")
for (p in pkgs) if (length(grep(paste("^package:", p, "$", sep=""), search())) == 0) {
    if (!requireNamespace(p)) warning(paste0("This vignette needs package `", p, "'; please install"))
}
invisible(suppressMessages(lapply(pkgs, require, character.only = TRUE)))
sys_options <- options()
new_options <- options(digits=2)

The package gathers information, meta-data and scripts in a two-part Henry-Stewart talk @zhao09, which showcases analysis in aspects such as testing of polymorphic variant(s) for Hardy-Weinberg equilibrium, association with trait using genetic and statistical models as well as Bayesian implementation, power calculation in study design and genetic annotation. It also covers R integration with the Linux environment, GitHub, package creation and web applications.

It is adapted from pQTLdata, https://jinghuazhao.github.io/pQTLdata/.

Hello, world!

We start with several ways of printting a Hello, world! message.

R

R can be started from either command line interface (CLI) or a graphical user interface (GUI),

print("Hello, world!\n")

Linux

As it is very powerful, we more often embed R in a Linux script as follows,

```{bash linux} export message="Hello, world!" echo "print('$message')" > hello.R R CMD BATCH hello.R R --no-save -q < hello.R R --no-save -q <<END message <- Sys.getenv("message"); print(message) source("hello.R") END echo ${message} | \ Rscript -e ' message <- scan("stdin", what="", sep="\n", quiet=TRUE); write.table(message, col.names=FALSE, row.names=FALSE, quote=FALSE) ' | \ cat rm hello.*

where the backslash (`\`) is for line continuation.

As shown in the example, one can take advantage of powerful data handling facilities in the Linux
environment, through either Linux itself or software followed by their counterparts in R with
options to feed back to the Linux envornment again for further use.

Moreover, R could be an integrated component of a workflow, e.g., as curated in **pQTLtools**
involving **snakemake**, <https://jinghuazhao.github.io/pQTLtools/articles/snakemake.html>.

# Language elements

Basic data manipulation of the `iris` data includes

```r
class(iris)
dim(iris)
str(iris)
head(iris,1)
tail(iris,1)

We would like to highlight two types of operators,

options(new_options)
data(diabetes,package="gaawr2")

mean_values <- diabetes %>%
  dplyr::filter(CLASS %in% c("Y", "N", "P")) %>%
  dplyr::mutate(
    Gender = dplyr::recode(Gender, "F" = "Female", "M" = "Male"),
    CLASS = dplyr::recode(CLASS, "Y" = "Yes", "N" = "No", "P" = "Predicted")
  ) %>%
  dplyr::group_by(CLASS, Gender) %>%
  dplyr::select(AGE:BMI) %>%
  dplyr::summarize(dplyr::across(dplyr::everything(), \(x) mean(x, na.rm = TRUE)))
kableExtra::kbl(mean_values,caption="Mean value by gender and diabetes category") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

mean_values_long <- mean_values %>%
  tidyr::pivot_longer(
    cols = AGE:BMI,
    names_to = "Variable",
    values_to = "Mean_Value"
  )
ggplot2::ggplot(mean_values_long,
  ggplot2::aes(x = Variable, y = Mean_Value, fill = CLASS)) +
  ggplot2::geom_col(position = ggplot2::position_dodge()) +
  ggplot2::facet_wrap(~ Gender) +
  ggplot2::labs(
    title = "",
    x = "Variable",
    y = "Mean Value",
    fill = "Diabetes Status" # Modified legend title for clarity
  ) +
  ggplot2::theme_bw() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))
options(sys_options)

We see higher mean age and HbA1c values in the "Yes" diabetes group as well as noticeable differences in BMI, which align with the known associations between these variables and diabetes.

The Comprehensive R Archive Network (CRAN) host and Bioconductor host many fined-tuned user-contributed packages, their installation is furnished through

Data analysis

Topics in this section underpins large-scale genome data analysis such as Genomewide association study (GWAS) and vary from those classic models such as mets for twin data to heavily featured in candidate gene studies, such as Hardy-Weinberg equilirium (HWE), to GWAS such as various types of association statistics, QQ/Manhattan/local association plots.

There has been a lot of interests in machine learning (ML), artificial language (AI), including deep learning, just to add one more acronym, the bulk of which is also readily available.

HardyWeinberg

We set to run through the package for HWE. Three data sources are used: MN blood group in the documentation, a chromosome X SNP and a HLA/DQR,

# MN blood group
SNP <- c(MM = 298, MN = 489, NN = 213)
HardyWeinberg::maf(SNP)
HardyWeinberg::HWTernaryPlot(SNP,region=0,grid=TRUE,markercol="blue")
HardyWeinberg::HWChisq(SNP, cc = 0, verbose = TRUE)
# Chromosome X
xSNP <- c(A=10, B=20, AA=30, AB=20, BB=10)
HardyWeinberg::HWChisq(xSNP,cc=0,x.linked=TRUE,verbose=TRUE)
# HLA/DQR
DQR <- gap.datasets::hla[,3:4]
a1 <- DQR[1]
a2 <- DQR[2]
GenotypeCounts <- HardyWeinberg::AllelesToTriangular(a1,a2)
kableExtra::kbl(GenotypeCounts,caption="Genotype distribution of DQR") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
HardyWeinberg::HWPerm.mult(GenotypeCounts,nperm=300)
HardyWeinberg::HWStr(hla[,3:4],test="permutation",nperm=300)

The MN locus is seen to be close to the HWE line from the ternary plot. Only 300 permutations are done for the HLA/DQR data to illustrate.

SNPassoc

The package implements procedures which are appropriate for candidate gene association analysis, under a variety of genetic models.

We first look at some meta-data, include HWE.

data(asthma, package = "SNPassoc")
str(asthma, list.len=8)
knitr::kable(asthma[1:3,1:8],caption="First three records & two SNPs")
snpCols <- colnames(asthma)[6+(1:2)]
snps <- SNPassoc::setupSNP(data=asthma[snpCols], colSNPs=1:length(snpCols), sep="")
head(snps)
summary(snps, print=FALSE)
lapply(snps, head)
lapply(snps, summary)
SNPassoc::tableHWE(snps)

where variable snpCols skips six columns of non-SNP data for two SNPs.

We then turn to genetic models for the first one,

asthma.snps <- asthma %>%
               dplyr::rename(cc=casecontrol) %>%
               SNPassoc::setupSNP(colSNPs=(6+1):ncol(.), sep="")
# Model 1: Simple SNP association with BMI
SNPassoc::association(bmi ~ rs4490198, data = asthma.snps)

# Model 2: SNP association with case-control status
SNPassoc::association(cc ~ rs4490198, data = asthma.snps)

# Model 3: SNP association with covariates (country and smoke)
SNPassoc::association(cc ~ rs4490198 + country + smoke, data = asthma.snps)

# Model 4: SNP association with stratification by gender
SNPassoc::association(cc ~ rs4490198 + survival::strata(gender), data = asthma.snps)

# Model 5: SNP association with subset (only Spain)
SNPassoc::association(cc ~ rs4490198, data = asthma.snps, subset = country == "Spain")

# Model 6: Interaction between SNP (dominant model) and smoking
SNPassoc::association(cc ~ SNPassoc::dominant(rs4490198) * factor(smoke), data = asthma.snps)

# Model 7: Interaction between two SNPs (dominant model for rs4490198)
SNPassoc::association(cc ~ rs4490198 * factor(rs11123242), data = asthma.snps, model.interaction = "dominant")

haplo.stats

This package considers haplotype estimation using EM-algorithms and genetic association under a generalized linear model (GLM).

# Association with the first three SNPs
snpsH <- names(asthma.snps)[6+(1:3)]
genoH <- SNPassoc::make.geno(asthma.snps, snpsH)
em <- haplo.stats::haplo.em(genoH, locus.label = snpsH, miss.val = c(0, NA))
haplo_table <- with(em,cbind(haplotype,hap.prob))
knitr::kable(haplo_table,caption="Haplotypes of the first three SNPs")
modH <- haplo.stats::haplo.glm(cc ~ genoH, data=asthma.snps,
                               family="binomial",
                               locus.label=snpsH,
                               allele.lev=attributes(genoH)$unique.alleles,
                               control = haplo.stats::haplo.glm.control(haplo.freq.min=0.05))
modH
SNPassoc::intervals(modH)

# Model comparison with / without haplotypes
mod.adj.ref <- glm(cc ~ smoke, data=asthma.snps, family="binomial")
mod.adj <- haplo.glm(cc ~ genoH + smoke, data=asthma.snps,
                 family="binomial",
                 locus.label=snpsH,
                 allele.lev=attributes(genoH3)$unique.alleles,
                 control = haplo.stats::haplo.glm.control(haplo.freq.min=0.05))
mod.adj
lrt.adj <- mod.adj.ref$deviance - mod.adj$deviance
pchisq(lrt.adj, mod.adj$lrt$df, lower=FALSE)

# Four variable slide windows over nine SNPs
snpsH <- labels(asthma.snps)[6+(1:9)]
genoH <- SNPassoc::make.geno(asthma.snps, snpsH)
haploH <- list()
for (i in 1:4) haploH[[i]] <- haplo.stats::haplo.score.slide(asthma.snps$cc, genoH,
                              trait.type="binomial",
                              n.slide=i,
                              locus.label=snpsH,
                              simulate=TRUE,
                              sim.control=haplo.stats::score.sim.control(min.sim=50,max.sim=100))

GWAS

We return to the asthma data used in SNPassoc.

assoc <- SNPassoc::WGassociation(cc, data=asthma.snps)
assoc.adj <- SNPassoc::WGassociation(cc ~ country + smoke, asthma.snps)
assoc.maxstat <- SNPassoc::maxstat(asthma.snps, cc)
assoc %>%
  as.data.frame() %>%
  dplyr::select(-comments) %>%
  knitr::kable(caption="SNP association")
assoc.adj %>%
  as.data.frame() %>%
  dplyr::select(-comments) %>%
  knitr::kable(caption="with adjustment for contountry & smoking")
assoc.maxstat %>%
  `[`(,) %>%
  t() %>%
  knitr::kable(caption = "Max stat association statistics")

where assoc.maxstat is coerced into a matrix later, but there appears problematic to knit though.

GMMAT

The following is modified slightly from the package vignette,

data(example,package="GMMAT")
attach(example)
model0 <- GMMAT::glmmkin(disease ~ age + sex, data = pheno, kins = GRM,
                         id = "id", family = binomial(link = "logit"))
model1 <- GMMAT::glmmkin(fixed = trait ~ age + sex, data = pheno, kins = GRM,
                         id = "id", family = gaussian(link = "identity"))
model2 <- GMMAT::glmmkin(fixed = trait ~ age + sex, data = pheno, kins = GRM,
                         id = "id", groups = "disease",
                         family = gaussian(link = "identity"))
snps <- c("SNP10", "SNP25", "SNP1", "SNP0")
geno.file <- system.file("extdata", "geno.bgen", package = "GMMAT")
samplefile <- system.file("extdata", "geno.sample", package = "GMMAT")
outfile <- "glmm.score.txt"
GMMAT::glmm.score(model0, infile = geno.file, BGEN.samplefile = samplefile,
                  outfile = outfile)
read.delim(outfile) |>
     head(n=4) |>
     knitr::kable(caption="Score tests under GLMM on four SNPs",digits=2)
unlink(outfile)
bed.file <- system.file("extdata", "geno.bed", package = "GMMAT") |>
            tools::file_path_sans_ext()
model.wald <- GMMAT::glmm.wald(fixed = disease ~ age + sex, data = pheno,
                               kins = GRM, id = "id", family = binomial(link = "logit"),
                               infile = bed.file, snps = snps)
knitr::kable(model.wald,caption="Wald tests under GLMM on four SNPs")
detach(example)

where both BGEN and PLINK binary file are illustrated.

h2.jags

The function uses JAGS (https://mcmc-jags.sourceforge.io/) for heritability estimation @zhao18,

set.seed(1234567)
meyer <- within(gap.datasets::meyer,{
         y[is.na(y)] <- rnorm(length(y[is.na(y)]),mean(y,na.rm=TRUE),sd(y,na.rm=TRUE))
         g1 <- ifelse(generation==1,1,0)
         g2 <- ifelse(generation==2,1,0)
         id <- animal
         animal <- ifelse(!is.na(animal),animal,0)
         dam <- ifelse(!is.na(dam),dam,0)
         sire <- ifelse(!is.na(sire),sire,0)
     })
G <- gap::kin.morgan(meyer)$kin.matrix*2
r <- regress::regress(y~-1+g1+g2,~G,data=meyer)
r
with(r,gap::h2G(sigma,sigma.cov))
eps <- 0.001
y <- with(meyer,y)
x <- with(meyer,cbind(g1,g2))
ex <- gap::h2.jags(y,x,G,sigma.p=0.03,sigma.r=0.014,n.chains=1,n.iter=80)
kableExtra::kbl(ex$BUGSoutput$summary,digits=2,caption="MCMC results for the Meyer data") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

To avoid multithread and excessive time for CRAN checking, only one chain and 80 iterations are run, 40 of which are burn-ins and every iteraction is kept (n.thin=1).

powerEQTL

Consider powereQTL.SLR (simple linear regression) for a sample size of 50-300 by 50, minor allele frequencies 0.005~0.5, $\alpha$=0.05. We have,

n.designs <- 6
designs <- 1:n.designs
N <- 50 * designs
n.grids <- 100
index <- 1:n.grids
grids <- index / n.grids
MAF <- seq(0.005, n.grids/2, by=0.5) / n.grids
plot(MAF,grids,type="n",ylab="Power")
mtext(expression(paste("(",alpha," = 0.05)")),1,line=4.5)
colors <- grDevices::hcl.colors(n.designs)
for (design in designs)
{
  power.SLR <- rep(NA,n.grids)
  for (j in index) power.SLR[j] <- powerEQTL::powerEQTL.SLR(MAF = MAF[j], FWER = 0.05, nTests = 240, slope = 0.13,
                                                            n = N[design], sigma.y = 0.13)
  lines(MAF,power.SLR,col=colors[design])
}
legend("bottomright", inset=.02, title="Sample size (N)", paste(N), col=colors, horiz=FALSE, cex=0.8, lty=designs)

The counterpart for single-cell RNA-Seq design is via powerEQTL.scRNAseq.

Annotations

EnsDb.Hsapiens.v75

ensembldb::metadata(EnsDb.Hsapiens.v75)
genes <- ensembldb::genes(EnsDb.Hsapiens.v75)
head(genes)
transcripts_data <- ensembldb::transcripts(EnsDb.Hsapiens.v75)
head(transcripts_data)

One can also use exons_data <- ensembldb::exons(EnsDb.Hsapiens.v75);head(exons_data) but it is skipped for being considerably longer.

biomaRt

if (!biomaRt::martBMCheck(mart)) {
  stop("The BioMart service is currently unavailable.")
}
biomaRt::listMarts()
ensembl <- biomaRt::useMart("ensembl", dataset="hsapiens_gene_ensembl", host="grch37.ensembl.org", path="/biomart/martservice")
biomaRt::listDatasets(ensembl)
attr <- biomaRt::listAttributes(ensembl)
attr_select <- c('ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'description', 'hgnc_symbol',
                 'transcription_start_site')
gene <- biomaRt::getBM(attributes = attr_select, mart = ensembl)
filter <- biomaRt::listFilters(ensembl)
biomaRt::searchFilters(mart = ensembl, pattern = "gene")
# GRCh38
ensembl <- biomaRt::useMart("ensembl", dataset="hsapiens_gene_ensembl")

Experimental Factor Ontology (EFO)

The ontology of traits/disease is formally available as this @malone10. The script below assumes that efo-3.26.0 has been downloaded.

library(ontologyIndex)
id <- function(ontology)
{
  inflammatory <- grep(ontology$name,pattern="inflammatory")
  immune <- grep(ontology$name,pattern="immune")
  inf <- union(inflammatory,immune)
  list(id=ontology$id[inf],name=ontology$name[inf])
}
# GO
data(go)
goidname <- id(go)
# EFO
file <- "efo.obo" # efo-3.26.0
get_relation_names(file)
efo <- get_ontology(file, extract_tags="everything")
length(efo) # 89
length(efo$id) # 27962
efoidname <- id(efo)
diseases <- get_descendants(efo,"EFO:0000408")
efo_0000540 <- get_descendants(efo,"EFO:0000540")
efo_0000540name <- efo$name[efo_0000540]
isd <- data.frame(efo_0000540,efo_0000540name)
library(ontologyPlot)
onto_plot(efo,efo_0000540)

OpenTargets

gene_id <- "ENSG00000164308"
query_string = "
  query target($ensemblId: String!){
    target(ensemblId: $ensemblId){
      id
      approvedSymbol
      biotype
      geneticConstraint {
        constraintType
        exp
        obs
        score
        oe
        oeLower
        oeUpper
      }
      tractability {
        label
        modality
        value
      }
    }
  }
"
base_url <- "https://api.platform.opentargets.org/api/v4/graphql"
variables <- list("ensemblId" = gene_id)
post_body <- list(query = query_string, variables = variables)
r <- httr::POST(url=base_url, body=post_body, encode='json')
data <- iconv(r, "", "ASCII")
content <- jsonlite::fromJSON(data)
target <- content$data$target
scalar_fields <- data.frame(
  Field = c("ID", "Approved Symbol", "Biotype"),
  Value = c(target$id, target$approvedSymbol, target$biotype)
)
tractability_data <- target$tractability
kableExtra::kbl(scalar_fields,caption="(a) Basic Information") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
kableExtra::kbl(target$geneticConstraint, caption="(b) Genetic Constraint Metrics") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
kableExtra::kbl(tractability_data,caption="(c) Tractability Information") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

where jsonlite::fromJSON(content(r,"text",encoding="UTF-8")) also works when R is nicely compiled with libiconv.

Additional packages

Packages gwasrapidd provides easy access to the GWAS Catalog, while rentrez enables search for GenBank and PubMed.

An overview on proteogenomics is available @suhre20. Some aspects of the analysis is given by pQTLtools, https://jinghuazhao.github.io/pQTLtools/.

gaawr2

While created as a showcase of modern package development, like other R packages it includes data examples, customized functions, documentation and featured articles. The workflow is shown in the following diagram.

graph TB; A[Package creation] --> B[GitHub respository]; B --> C[Pkgdown styling];C --> D[Refinement];D --> E[Testing]

The relevant scripts are with inst/scripts directory in the source package. Briefly,

Note that for creation of the GitHub repository, an SSH key is assumed in place. In order for pkgdown.sh to function well, all required files such as nature-genetics.csl need to be available.

Moreover, the devtools::document() in pkgdown.sh automatically updates NAMESPACE and regenerates documentation files (.Rd), which can be picked up through pkgdown::build_reference(). The refinement is greatly facilitated by GitHub R-CMD-check.yaml workflow, namely, https://github.com/jinghuazhao/gaawr2/actions/workflows/R-CMD-check.yaml, e.g., flagging missing packages in package building.

A GitHub login is still necessary to enable web pages, so that this can be accessed as https://jinghuazhao.github.io/gaawr2/. Upon use of pkgdown, an article can be seen from the menu item Articles.

We carry on adding files such as NEWS.md and _pkgdown.yml, involing MathJax and mermaid:

math-rendering: mathjax
includes:
  after_body: <script type="module">import mermaid from "https://unpkg.com/mermaid@10.4.0/dist/mermaid.esm.min.mjs";mermaid.initialize({});</script>

In line with various analyses we have covered, their associate packages are also added to the suggested list of packages in DESCRIPTION:

suggests <- read.dcf(file = system.file("DESCRIPTION", package = "gaawr2"), fields = c("Suggests"))
write.dcf(suggests)

Summary

Following part I of the talk, we have further explored various aspects of genetic association analysis in R, particularly in the context of computing systems like Linux. While these serve as a solid foundation, a more in-depth data analysis coupled with more rigorous development is surely fruitful and rewarding.

References



Try the gaawr2 package in your browser

Any scripts or data that you put into this service are public.

gaawr2 documentation built on April 4, 2025, 2:25 a.m.