finemap_loci: Fine-map multiple loci

View source: R/finemap_loci.R

finemap_lociR Documentation

Fine-map multiple loci

Description

echolocatoR will automatically fine-map each locus. Uses the topSNPs data.frame to define locus coordinates.

Usage

finemap_loci(
  loci = NULL,
  fullSS_path,
  fullSS_genome_build = NULL,
  results_dir = file.path(tempdir(), "results"),
  dataset_name = "dataset_name",
  dataset_type = "GWAS",
  topSNPs = "auto",
  force_new_subset = FALSE,
  force_new_LD = FALSE,
  force_new_finemap = FALSE,
  finemap_methods = c("ABF", "FINEMAP", "SUSIE"),
  finemap_args = NULL,
  n_causal = 5,
  credset_thresh = 0.95,
  consensus_thresh = 2,
  fillNA = 0,
  conditioned_snps = "auto",
  priors_col = NULL,
  munged = FALSE,
  colmap = echodata::construct_colmap(munged = munged),
  compute_n = "ldsc",
  LD_reference = "1KGphase3",
  LD_genome_build = "hg19",
  leadSNP_LD_block = FALSE,
  superpopulation = "EUR",
  download_method = "axel",
  bp_distance = 5e+05,
  min_POS = NA,
  max_POS = NA,
  min_MAF = NA,
  trim_gene_limits = FALSE,
  max_snps = NULL,
  min_r2 = 0,
  remove_variants = FALSE,
  remove_correlates = FALSE,
  query_by = "tabix",
  case_control = TRUE,
  qtl_suffixes = NULL,
  plot_types = c("simple"),
  show_plot = TRUE,
  zoom = "1x",
  tx_biotypes = NULL,
  nott_epigenome = FALSE,
  nott_show_placseq = FALSE,
  nott_binwidth = 200,
  nott_bigwig_dir = NULL,
  xgr_libnames = NULL,
  roadmap = FALSE,
  roadmap_query = NULL,
  remove_tmps = TRUE,
  conda_env = "echoR_mini",
  return_all = TRUE,
  use_tryCatch = TRUE,
  seed = 2022,
  nThread = 1,
  verbose = TRUE,
  top_SNPs = deprecated(),
  PP_threshold = deprecated(),
  consensus_threshold = deprecated(),
  plot.Nott_epigenome = deprecated(),
  plot.Nott_show_placseq = deprecated(),
  plot.Nott_binwidth = deprecated(),
  plot.Nott_bigwig_dir = deprecated(),
  plot.Roadmap = deprecated(),
  plot.Roadmap_query = deprecated(),
  plot.XGR_libnames = deprecated(),
  server = deprecated(),
  plot.types = deprecated(),
  plot.zoom = deprecated(),
  QTL_prefixes = deprecated(),
  vcf_folder = deprecated(),
  probe_path = deprecated(),
  file_sep = deprecated(),
  chrom_col = deprecated(),
  chrom_type = deprecated(),
  position_col = deprecated(),
  snp_col = deprecated(),
  pval_col = deprecated(),
  effect_col = deprecated(),
  stderr_col = deprecated(),
  tstat_col = deprecated(),
  locus_col = deprecated(),
  freq_col = deprecated(),
  MAF_col = deprecated(),
  A1_col = deprecated(),
  A2_col = deprecated(),
  gene_col = deprecated(),
  N_cases_col = deprecated(),
  N_controls_col = deprecated(),
  N_cases = deprecated(),
  N_controls = deprecated(),
  proportion_cases = deprecated(),
  sample_size = deprecated(),
  PAINTOR_QTL_datasets = deprecated()
)

Arguments

loci

Character list of loci in Locus col of topSNPs.

fullSS_path

Path to the full summary statistics file (GWAS or QTL) that you want to fine-map. It is usually best to provide the absolute path rather than the relative path.

fullSS_genome_build

Genome build of the full summary statistics (fullSS_path). Can be "GRCH37" or "GRCH38" or one of their synonyms.. If fullSS_genome_build==NULL and munged=TRUE, infers genome build (hg19 vs. hg38) from summary statistics using get_genome_builds.

results_dir

Where to store all results. IMPORTANT!: It is usually best to provide the absolute path rather than the relative path. This is especially important for FINEMAP.

dataset_name

The name you want to assign to the dataset being fine-mapped, This will be used to name the subdirectory where your results will be stored (e.g. Data/GWAS/<dataset_name>). Don't use special characters (e.g.".", "/").

dataset_type

The kind dataset you're fine-mapping (e.g. GWAS, eQTL, tQTL). This will also be used when creating the subdirectory where your results will be stored (e.g. Data/<dataset_type>/Kunkle_2019).

topSNPs

A data.frame with the genomic coordinates of the lead SNP for each locus. The lead SNP will be used as the center of the window when extracting subset from the full GWAS/QTL summary statistics file. Only one SNP per Locus should be included. At minimum, topSNPs should include the following columns:

Locus

A unique name for each locus. Often, loci are named after a relevant gene (e.g. LRRK2) or based on the name/coordinates of the lead SNP (e.g. locus_chr12_40734202)

CHR

The chromosome that the SNP is on. Can be "chr12" or "12" format.

POS

The genomic position of the SNP (in basepairs)

force_new_subset

By default, if a subset of the full summary stats file for a given locus is already present, then echolocatoR will just use the pre-existing file. Set force_new_subset=T to override this and extract a new subset. Subsets are saved in the following path structure: Data/\<dataset_type\>/\<dataset_name\>/\<locus\>/Multi-finemap/ \<locus\>_\<dataset_name\>_Multi-finemap.tsv.gz

force_new_LD

Force new LD subset.

force_new_finemap

By default, if an fine-mapping results file for a given locus is already present, then echolocatoR will just use the preexisting file. Set force_new_finemap=T to override this and re-run fine-mapping.

finemap_methods

Which fine-mapping methods you want to use.

finemap_args

A named nested list containing additional arguments for each fine-mapping method. e.g. finemap_args = list(FINEMAP=list(), PAINTOR=list(method=""))

n_causal

The maximum number of potential causal SNPs per locus. This parameter is used somewhat differently by different fine-mapping tools. See tool-specific functions for details.

credset_thresh

The minimum fine-mapped posterior probability for a SNP to be considered part of a Credible Set. For example, credset_thresh=.95 means that all Credible Set SNPs will be 95% Credible Set SNPs.

consensus_thresh

The minimum number of fine-mapping tools in which a SNP is in the Credible Set in order to be included in the "Consensus_SNP" column.

fillNA

Value to fill LD matrix NAs with.

conditioned_snps

Which SNPs to conditions on when fine-mapping with (e.g. COJO).

priors_col

[Optional] Name of the a column in dat to extract SNP-wise prior probabilities from.

munged

Whether fullSS_path have already been standardised/filtered full summary stats with format_sumstats. If munged=FALSE you'll need to provide the necessary column names to the colmap argument.

colmap

Column name mappings in in fullSS_path. Must be a named list. Can use construct_colmap to assist with this. This function can be used in two different ways:

  • munged=FALSE : When munged=FALSE, you will need to provide the necessary column names to the colmap argument (default).

  • munged=TRUE : Alternatively, instead of filling out each argument in construct_colmap, you can simply set munged=TRUE if fullSS_path has already been munged with format_sumstats.

compute_n

How to compute per-SNP sample size (new column "N").
If the column "N" is already present in dat, this column will be used to extract per-SNP sample sizes and the argument compute_n will be ignored.
If the column "N" is not present in dat, one of the following options can be supplied to compute_n:

  • 0: N will not be computed.

  • >0: If any number >0 is provided, that value will be set as N for every row. **Note**: Computing N this way is incorrect and should be avoided if at all possible.

  • "sum": N will be computed as: cases (N_CAS) + controls (N_CON), so long as both columns are present.

  • "ldsc": N will be computed as effective sample size: Neff =(N_CAS+N_CON)*(N_CAS/(N_CAS+N_CON)) / mean((N_CAS/(N_CAS+N_CON))(N_CAS+N_CON)==max(N_CAS+N_CON)).

  • "giant": N will be computed as effective sample size: Neff = 2 / (1/N_CAS + 1/N_CON).

  • "metal": N will be computed as effective sample size: Neff = 4 / (1/N_CAS + 1/N_CON).

LD_reference

LD reference to use:

  • "1KGphase1" : 1000 Genomes Project Phase 1 (genome build: hg19).

  • "1KGphase3" : 1000 Genomes Project Phase 3 (genome build: hg19).

  • "UKB" : Pre-computed LD from a British European-decent subset of UK Biobank. Genome build : hg19

  • "<vcf_path>" : User-supplied path to a custom VCF file to compute LD matrix from.
    Accepted formats: .vcf / .vcf.gz / .vcf.bgz
    Genome build : defined by user with target_genome.

  • "<matrix_path>" : User-supplied path to a pre-computed LD matrix Accepted formats: .rds / .rda / .csv / .tsv / .txt
    Genome build : defined by user with target_genome.

LD_genome_build

Genome build of the LD panel. This is automatically assigned to the correct genome build for each LD panel except when the user supplies custom vcf/LD files.

leadSNP_LD_block

Only return SNPs within the same LD block as the lead SNP (the SNP with the smallest p-value).

superpopulation

Superpopulation to subset LD panel by (used only if LD_reference is "1KGphase1" or "1KGphase3"). See popDat_1KGphase1 and popDat_1KGphase3 for full tables of their respective samples.

download_method
  • "axel" : Multi-threaded

  • "wget" : Single-threaded

  • "download.file" : Single-threaded

  • "internal" : Single-threaded (passed to download.file)

  • "wininet" : Single-threaded (passed to download.file)

  • "libcurl" : Single-threaded (passed to download.file)

  • "curl" : Single-threaded (passed to download.file)

bp_distance

Distance around the lead SNP to include.

min_POS

Minimum genomic position to include.

max_POS

Maximum genomic position to include.

min_MAF

Minimum Minor Allele Frequency (MAF) of SNPs to include.

trim_gene_limits

If a gene name is supplied to this argument (e.g. trim_gene_limits="BST"), only SNPs within the gene body will be included.

max_snps

Maximum number of SNPs to include.

min_r2

Correlation threshold for remove_correlates.

remove_variants

A list of SNP RSIDs to remove.

remove_correlates

A list of SNPs. If provided, all SNPs that correlates with these SNPs (at r2>=min_r2) will be removed from both dat and LD list items..

query_by

Choose which method you want to use to extract locus subsets from the full summary stats file. Methods include:

"tabix"

Convert the full summary stats file in an indexed tabix file. Makes querying lightning fast after the initial conversion is done. (default)

"coordinates"

Extract locus subsets using min/max genomic coordinates with awk.

case_control

Whether the summary statistics come from a case-control study (e.g. a GWAS of having Alzheimer's Disease or not) (TRUE) or a quantitative study (e.g. a GWAS of height, or an eQTL) (FALSE).

qtl_suffixes

If columns with QTL data is included in dat, you can indicate which columns those are with one or more string suffixes (e.g. qtl_suffixes=c(".eQTL1",".eQTL2") to use the columns "P.QTL1", "Effect.QTL1", "P.QTL2", "Effect.QTL2").

plot_types

Which kinds of plots to include. Options:

  • "simple"Just plot the following tracks: GWAS, fine-mapping, gene models

  • "fancy"Additionally plot XGR annotation tracks (XGR, Roadmap, Nott2019). '

  • "LD"LD heatmap showing the 10 SNPs surrounding the lead SNP.

show_plot

Print plot to screen.

zoom

Zoom into the center of the locus when plotting (without editing the fine-mapping results file). You can provide either:

  • The size of your plot window in terms of basepairs (e.g. zoom=50000 for a 50kb window).

  • How much you want to zoom in (e.g. zoom="1x" for the full locus, zoom="2x" for 2x zoom into the center of the locus, etc.).

You can pass a list of window sizes (e.g. c(50000,100000,500000)) to automatically generate multiple views of each locus. This can even be a mix of different style inputs: e.g. c("1x","4.5x",25000).

tx_biotypes

Transcript biotypes to include in the gene model track. By default (NULL), all transcript biotypes will be included. See get_tx_biotypes for a full list of all available biotypes

nott_epigenome

Include tracks showing brain cell-type-specific epigenomic data from Nott et al. (2019).

nott_show_placseq

Include track generated by NOTT2019_plac_seq_plot.

nott_binwidth

When including Nott et al. (2019) epigenomic data in the track plots, adjust the bin width of the histograms.

nott_bigwig_dir

Instead of pulling Nott et al. (2019) epigenomic data from the UCSC Genome Browser, use a set of local bigwig files.

xgr_libnames

Passed to XGR_plot. Which XGR annotations to check overlap with. For full list of libraries see here. Passed to the RData.customised argument in xRDataLoader. Examples:

  • "ENCODE_TFBS_ClusteredV3_CellTypes"

  • "ENCODE_DNaseI_ClusteredV3_CellTypes"

  • "Broad_Histone"

roadmap

Find and plot annotations from Roadmap.

roadmap_query

Only plot annotations from Roadmap whose metadata contains a string or any items from a list of strings (e.g. "brain" or c("brain","liver","monocytes")).

remove_tmps

Whether to remove any temporary files (e.g. FINEMAP output files) after the pipeline is done running.

conda_env

Conda environment to use.

return_all

Return a nested list of various the pipeline's outputs including plots, tables, and file paths (default: TRUE). If FALSE, instead only returns a single merged data.table containing the results from all loci.

use_tryCatch

If an error is encountered in one locus, the pipeline will continue to try running the rest of the loci (default: use_tryCatch=TRUE). This avoid stopping all analyses due to errors that only affect some loci, but currently prevents debugging via traceback.

seed

Set the seed for all functions where this is possible.

nThread

Number of threads to parallelise saving across.

verbose

Print messages.

top_SNPs

[deprecated]

PP_threshold

[deprecated]

consensus_threshold

[deprecated]

plot.Nott_epigenome

[deprecated]

plot.Nott_show_placseq

[deprecated]

plot.Nott_binwidth

[deprecated]

plot.Nott_bigwig_dir

[deprecated]

plot.Roadmap

[deprecated]

plot.Roadmap_query

[deprecated]

plot.XGR_libnames

[deprecated]

server

[deprecated]

plot.types

[deprecated]

plot.zoom

[deprecated]

QTL_prefixes

[deprecated]

vcf_folder

[deprecated]

probe_path

[deprecated]

file_sep

[deprecated]

chrom_col

[deprecated]

chrom_type

[deprecated]

position_col

[deprecated]

snp_col

[deprecated]

pval_col

[deprecated]

effect_col

[deprecated]

stderr_col

[deprecated]

tstat_col

[deprecated]

locus_col

[deprecated]

freq_col

[deprecated]

MAF_col

[deprecated]

A1_col

[deprecated]

A2_col

[deprecated]

gene_col

[deprecated]

N_cases_col

[deprecated]

N_controls_col

[deprecated]

N_cases

[deprecated]

N_controls

[deprecated]

proportion_cases

[deprecated]

sample_size

[deprecated]

PAINTOR_QTL_datasets

[deprecated]

Value

By default, returns a nested list containing grouped by locus names (e.g. BST1, MEX3C). The results of each locus contain the following elements:

  • finemap_dat : Fine-mapping results from all selected methods merged with the original summary statistics (i.e. Multi-finemap results).

  • locus_plot : A nested list containing one or more zoomed views of locus plots.

  • LD_matrix : The post-processed LD matrix used for fine-mapping.

  • LD_plot : An LD plot (if used).

  • locus_dir : Locus directory results are saved in.

  • arguments : A record of the arguments supplied to finemap_loci.

In addition, the following object summarizes the results from all the locus-specific results:

  • merged_dat : A merged data.table with all fine-mapping results from all loci.

See Also

Other MAIN: finemap_locus()

Examples

topSNPs <- echodata::topSNPs_Nalls2019
fullSS_path <- echodata::example_fullSS(dataset = "Nalls2019")

res <- echolocatoR::finemap_loci(
  fullSS_path = fullSS_path,
  topSNPs = topSNPs,
  loci = c("BST1","MEX3C"),
  finemap_methods = c("ABF","FINEMAP","SUSIE"),
  dataset_name = "Nalls23andMe_2019",
  fullSS_genome_build = "hg19",
  bp_distance = 1000,
  munged = TRUE)

RajLabMSSM/echolocatoR documentation built on Jan. 29, 2023, 6:04 a.m.