In RHReynolds/LDSCforRyten: Run LDSC on Ryten Lab Server

library(corrplot)
library(devtools)
library(DT)
library(here)
library(ggplot2)
library(GeneOverlap)
library(LDSCforRyten)
library(tidyverse)
library(readxl)
library(UpSetR)
devtools::load_all(path = "/home/rreynolds/projects/MarkerGenes/")

predictions <- read_csv(file = "~/misc_projects/JuanBotia_G2PML/data/20190718_allpredictions.csv") %>% 
  dplyr::mutate(panel = str_replace(panel, ".rds", ""),
                # Rename 
                panel = str_replace(panel, "Early_onset_dementia_\\(encompassing_fronto-temporal_dementia_and_prion_disease\\)", "Early_onset_dementia_with_FTD_and_prion"),
                panel = str_replace_all(panel, "_", ".")) %>% 
  dplyr::select(-X1)

knitr::opts_chunk$set(echo = T, message= F, warning = F)

Background

s-LDSC (https://www.ncbi.nlm.nih.gov/pubmed/26414678) is a method that allows you to determine the relative contribution of an annotation (in this case, our modules of interest) to disease heritability. We use the co-efficient p-value as our read-out for significance, as it tells us whether our annotation is significantly contributing to disease heritability after we have accounted for underlying genetic architecture (as represented by a 97-annotation baseline model, which tags coding regions, enhancer regions, histones, promoters, LD structures and MAF etc.). See supplementary table 1 in https://www.ncbi.nlm.nih.gov/pubmed/26414678.
Experimental details: Used gene predictions from Juan Botia's G2P biorxiv paper (https://www.biorxiv.org/content/10.1101/288845v1). Genes were predicted using gene panels defined by Genomics England together with 154 gene-specific features.
Aim: Use stratified LDSC to determine whether predicted genes based are enriched for disease heritability.

Running LDSC

Restricting gene lists

Need to find the appropriate balance between prediction quality and the number of genes within each panel to run with LDSC.
If using cut-off of quality >= 1, the number of genes per panel is:

predictions %>% 
  dplyr::filter(quality >= 1) %>% 
  dplyr::group_by(panel) %>% 
  summarise(n = n())

However, using cut-off of quality >= 0.8, the number of genes per panel is:

predictions %>% 
  dplyr::filter(quality >= 0.8) %>% 
  dplyr::group_by(panel) %>% 
  summarise(n = n())

Lowering the prediction quality cut-off does increase the number of genes for some of the smaller panels e.g. Parkinson Disease and Complex Parkinsonism. However, it also vastly increases the number of genes in some of the other panels. This is likely to affect LDSC performance, as well, as panels become too non-specific.
For this reason, will run with prediction >= 1.

Creating gene lists

filtered_predictions <- 
  # Create dataframe containing panels and gene predictions with quality >= 1
  predictions %>% 
  dplyr::filter(quality >= 0.8) %>% 
  dplyr::mutate(annot = "prediction.80") %>% 
  # Bind additional dataframe containing panels and gene predictions with quality >= 0.8
  bind_rows(predictions %>% 
              dplyr::filter(quality >= 1) %>% 
              dplyr::mutate(annot = "prediction.100"))

Overlap between predicted genes across panels

Prediction quality >= 1

If we want to interpret later results of LDSC, it would be worthwhile knowing the overlap between various gene panel predictions. Using only those genes with a predicted quality of 1, the overlap in real numbers is displayed below.

# Create lists of genes to perform overlap
gene_names_list <- setNames(filtered_predictions %>% 
                              dplyr::filter(annot == "prediction.100") %>% 
                              group_split(panel),
                            filtered_predictions %>% 
                              dplyr::filter(annot == "prediction.100") %>% 
                              .[["panel"]] %>% 
                              unique() %>% 
                              sort()) %>% 
  lapply(., function(x) x %>% .[["gene"]])

upset(fromList(gene_names_list), sets = names(gene_names_list), keep.order = TRUE, nintersects = 30, order.by = "freq")

As the two biggest gene lists, the intellectual disability and genetic epilepsy syndromes prediction panels appear to have the most overlaps with various other panels.
While knowing the actual number of genes overlapping between lists is useful, it is also good to have a relative measure of this. For this we can use the Jaccard index - defined as the intersection between two sets divided by the union of those two sets. In other words, you take into account the varying sizes of different modules and get a read out of the overlap between the two.
In the figure below, I've highlighted any overlaps with Jaccard > 0.05 i.e. greater than 5% overlap.

gom.self <- newGOM(gene_names_list, gene_names_list, genome.size = 21196)
All.jaccard <- getMatrix(gom.self, "Jaccard")
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(All.jaccard, method="color", col=col(100), cl.lim = c(0,1), 
          type="lower", 
          # addCoef.col = "black", number.cex = 0.75, # Add coefficient of correlation
          tl.col="black", tl.srt=45, tl.cex = 0.5, mar=c(0,0,1,0), #Text label color and rotation
          p.mat = All.jaccard, sig.level = (0.05), insig = "p-value", number.cex = 0.5,  
          diag = TRUE)

Looking at relative degrees of overlap, the following can be observed:
- Largest overlap is found between intellectual disability and genetic epilepsy syndromes, with a 30% overlap.
- Second largest is between distal and congentical myopathy panels -- based purely on naming conventions, this is perhaps not unexpected.
- Third largest overlap is between cerebral vascular malformations and early onset dementia panels. Biologically, there is thought to be a link between vascular dysregulation and late-onset AD (https://www.nature.com/articles/ncomms11934); not sure what evidence there is for an association between vascular dysregulation and early onset dementias.
- The congenital myopathy panel has a >= 5% overlap with all but one panel (congenital muscular dystrophy).
- The congenital muscular dystrophy panel has no overlaps with other panels that are >= 5%. With 106 genes, it is not the smallest panel so it is not necessarily due to it's smaller size.

Creating LDSC annotations

Annotations can be created using a number of functions available in the LDSCforRyten package.
In addition to creating annotations with SNPs found within start and end site for transcription, we also include SNPs found within 100kb upstream and downstream of these sites (as in: https://www.ncbi.nlm.nih.gov/pubmed/29632380)
Given that we are not looking for critical cell types/tissues, then according to developer guidelines on baseline model (https://data.broadinstitute.org/alkesgroup/LDSCORE/readme_baseline_versions) we should use baseline LD v2.2 (97 annotations). This was published here: https://www.cell.com/ajhg/supplemental/S0002-9297(19)30053-9. And can be downloaded here: https://data.broadinstitute.org/alkesgroup/LDSCORE/
As we are using positional mapping to generate annotations, make sure the appropriate genome build is selected when extracting gene positions. Bear in mind that baseline v1.2 is based on GRCh38, while all the remaining baselines are based on GRCh37.

# The following snippet only needs to be run once.
# Extract positional information and add 100kb window around genes within dataframe
filtered_predictions <- filtered_predictions %>% 
  LDSCforRyten::AddGenePosDetails_RemoveXandYandMT(.,
                                     columnToFilter = "gene", 
                                     mart = 37, 
                                     attributes = c("hgnc_symbol","chromosome_name", 
                                                             "start_position","end_position"),
                                     filter = c("hgnc_symbol")) %>% 
  LDSCforRyten::AddBPWindow(windowsize = 100000) %>% 
  # Add column combining panel and prediction quality into one name, to allow splitting of dataframe into lists
  dplyr::mutate(combined = str_c(annot, panel, sep = ":")) %>% 
  dplyr::select(combined, everything())


# Split dataframe by panel/prediction quality into individual dataframes and store within a named list
# NOTE: Names assigned to each individual dataframe will be used to output files to a directory, and as the file name. It is VERY important that these names contain no '_', as this will downstream use of Assimilate_H2_results() function, which will remove anything prior to the last _ in the annotation name.  
gene_list <- setNames(filtered_predictions %>% 
                           group_split(combined),
                         filtered_predictions %>% 
                           .[["combined"]] %>% 
                           unique() %>% 
                           sort())

# Loading baseline and creating Granges object from baseline model
BM <- LDSCforRyten::creating_baseline_df(baseline_model = "97")
BM_GR <- LDSCforRyten::df2GRanges(BM, 
                                  seqname_col = "CHR", 
                                  start_col = "BP",
                                  end_col = "BP")

# Find overlapping regions between genes in each annotation and BM (baseline model)
hits <- gene_list %>%
  LDSCforRyten::overlap_annot_list(BM_GR, 
                                   seqname_col = "chromosome_name", 
                                   start_col = "Gene.Start.MinusKB", 
                                   end_col = "Gene.end.PlusKB")

# Annotate BM with 1s where annotation overlaps with the baseline model
list.BM <- hits %>%
  overlap_annot_hits_w_baseline(BM)

# Exporting files
annot_basedir <- "/home/rreynolds/data/Extra_Annotations/Juan2019.ML/"
LDSCforRyten::create_annot_file_and_export(list.BM, annot_basedir = annot_basedir)

Running LDSC

Ran with the following GWASs:

GWASs <- as.character("AD2018,ALS2018,Anxiety2016.cc,Anxiety2016.fs,ILAE.Epilepsy2018.All,ILAE.Epilepsy2018.Focal,ILAE.Epilepsy2018.Generalised,Height2018,Intelligence2018,MDD2018_ex23andMe,MS,OCD2017,PD2018_meta5_ex23andMe,RA2014.EUR,SCZ2018,WHR") %>% 
  str_split(",") %>% 
  unlist()

read_excel(path = here::here("misc", "LDSC_GWAS_details.xlsx")) %>% 
  dplyr::filter(Original_name %in% GWASs) %>% 
  dplyr::select(-contains("Full_paths"), -Original_name, -Notes) %>% 
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')

S-LDSC can be run from the command line using the LDSC_Pipeline_Entire.R script. This script has multiple optional flags, including:
- --GWAS: If called, this flag allows the user to select which GWAS to run with. If this option is not called, then the script will default to running all available GWASs. Note that running with all GWASs will slow the process.
- --baseline_model: Use this flag to specify which baseline model should be used to estimate heritability. By default, the 97 annotation will be used, which includes annotations relating to genetic architecture, LD, MAF, conservation, etc.
- --calculate_LD_alone: If user only wants to calculate LD scores for an annotation, argument 'TRUE' must be supplied.
- --calculate_h2_alone: If user only wants to calculate heritability estimates, argument 'TRUE' must be supplied.
- For more details, please refer to the -h flag.
For this tutorial, the following command was run:

```{bash LDSC arguments, eval = F, tidy = T} cd /home/rreynolds/data/Extra_Annotations/Juan2019.ML/

Prediction quality >= 1

nohup Rscript \ /home/rreynolds/packages/LDSCforRyten/pipelines/LDSC_Pipeline_Entire.R \ /home/rreynolds/data/Extra_Annotations/ \ Juan2019.ML \ prediction.100:Cerebral.vascular.malformations,prediction.100:Congenital.muscular.dystrophy,prediction.100:Congenital.myopathy,prediction.100:Distal.myopathies,prediction.100:Early.onset.dementia.with.FTD.and.prion,prediction.100:Early.onset.dystonia,prediction.100:Genetic.epilepsy.syndromes,prediction.100:Intellectual.disability,prediction.100:Parkinson.Disease.and.Complex.Parkinsonism,prediction.100:Skeletal.Muscle.Channelopathies \ --GWAS=AD2018,ALS2018,Anxiety2016.cc,Anxiety2016.fs,ILAE.Epilepsy2018.All,ILAE.Epilepsy2018.Focal,ILAE.Epilepsy2018.Generalised,Height2018,Intelligence2018,MDD2018_ex23andMe,MS,OCD2017,PD2018_meta5_ex23andMe,RA2014.EUR,SCZ2018,WHR \ --baseline_model=97 \ &>/home/rreynolds/misc_projects/JuanBotia_G2PML/nohup_logs/20190731_LDSC_Prediction_100.log&

# Results
- A number of functions are availble to help collect the output of LDSC. In the sample code below, only `Assimilate_H2_results()` was used. There are also, however, functions for plotting the results of these (not used in this tutorial). For details, please refer to `LDSC_SummariseOutput_Functions.R` found in the `R` directory of this package.

## Collecting the results

```r

results <- LDSCforRyten::Assimilate_H2_results(path_to_results = list.files(path = "/home/rreynolds/data/Extra_Annotations/Juan2019.ML/Output/", 
                                                                            pattern = ".results", 
                                                                            full.names = T)) %>% 
  Calculate_enrichment_SE_and_logP(one_sided = "+") %>% 
  separate(annot_name, into = c("prediction_quality", "panel"),sep = ":") %>% 
  dplyr::select(GWAS, prediction_quality, panel, everything()) %>% 
  dplyr::mutate(GWAS = str_replace(GWAS, "ILAE.Epilepsy2018.All", "EPI All"),
                GWAS = str_replace(GWAS, "ILAE.Epilepsy2018.Focal", "EPI Focal"),
                GWAS = str_replace(GWAS, "ILAE.Epilepsy2018.Generalised", "EPI Generalised"),
                GWAS = str_replace(GWAS, "PD2018.ex23andMe", "PD2018.ex23&Me"))

# # Saved once
# write_csv(x = results %>%
#   dplyr::select(-Enrichment.Lower.SE, -Enrichment.Upper.SE, -Log_P, -Coefficient.Lower.SE, -Coefficient.Upper.SE),
#   path = "/home/rreynolds/misc_projects/JuanBotia_G2PML/results/20190802_LDSCsummary.csv")

# Only first three results shown
results[1:3,] %>% 
  dplyr::filter(prediction_quality == "prediction.100") %>% 
  dplyr::select(-Enrichment_p, -Enrichment.Lower.SE, -Enrichment.Upper.SE, -Log_P, -Coefficient.Lower.SE, -Coefficient.Upper.SE)

Plotting the results

Multipe test corrections were applied as follows:
- p < 0.05/(1 x 10) (as represented by black dashed line)
- 1 reflects the number of GWAS i.e. 1 (keeping the multiple correction relatively lax at this point and not correcting for GWAS, although this would typically be done)
- 10 reflects the number of panels tested
Presented in the figures is:
- Coefficient p-value: tests whether an annotation category positively contributes to trait heritability, conditional upon the LDSC baseline model (which in this case is the 97 annotation model that accounts for genetic architecture, LD structures and sequence conservation)
- Only plotted a few examples.

ggplot(data = results[1:3,] %>% 
         dplyr::filter(prediction_quality == "prediction.100"),
       aes(x = MarkerGenes::reorder_within(x = panel,
                                           by = Z_score_logP,
                                           within = GWAS,
                                           fun = max,
                                           desc = TRUE),
           y = Z_score_logP)) + 
  geom_point(aes(), shape = 19, size = 1.5) + 
  geom_hline(aes(yintercept = -log10(0.05/(1*10))), linetype = "77", size = 0.6) + 
  MarkerGenes::scale_x_reordered() +
  coord_flip() +
  facet_wrap(~ GWAS, scales = "free_y", ncol = 2) +
  labs(x = "", y = expression("Coefficient P-value (-log"[1][0]~"p)"), title = "") +
  theme_bw() + 
  theme(axis.title = element_text(size = 8, face = "bold"),
        axis.text.x = element_text(size = 6, angle = 45, hjust = 1, vjust = 1),
        strip.text = element_text(size = 6),
        legend.position = "none",
        legend.key.size = unit(1,"line")) +
  scale_colour_manual(guide = FALSE,
                      name = element_blank())