In nwvankuren/genomics-plotting: Utilities for plotting statistics from genome-wide analyses

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(gwplotting)

Introduction {#introduction}

The gwplotting package helps simplify plotting of common statistics from genome-wide analyses, for example genome-wide association studies and windowed population genomics analyses. These functions depend primarily on simple manipulations using the tidyverse package. The functions included in gwplotting fall into three primary categories:

File loading functions
- load_gemma_gwas
- load_plink_gwas
- load_vcftools_stats
- load_popgenWindows
- load_abbababa
- load_plink_ld
Reordering or binning functions
- reorder_scaffolds
- reorder_by_scaf_len
- get_cumulative_positions
- calculate_ld_decay
Plotting functions
- plot_genomewide_data
- plot_region_data
- add_annotations

Most of the functions have self-explanatory names. Below I will outline a few case studies to highlight how this package can be used.

Motivation

Our lab has sequenced, assembled, and annotated a number of non-model species' genomes. We then typically generate population re-sequencing data that we use to perform population genetic and population genomics analyses. The variety and frequency with which we analyze different genomes and datasets prompted me to make a simple package that will perform common plotting tasks. This package takes as input a few common file types to output a number of useful plots that can be refined using additional ggplot2 functions, if necessary.

Important file types and objects

Many gwplotting functions require a small set of key files and objects that can be easily generated with widely-used tools.

1. Key files

A. Scaffold lengths {#scaf-lens}

Reordering and plotting functions require a 2-column tab-delimited file containing scaffold names and their lengths, e.g:

scaffold1      1015
scaffold2     12543
...
scaffold1234 128910

You can generate this file using samtools faidx and cut the first two columns from the .fai file. I typically call this <genome_name>.info.

B. Mapping files {#mapping-files}

It is usually useful to try and assign your draft genome scaffolds to a chromosome-level genome assembly from a closely-related species. This mapping information can then be used with certain functions to reorder your results according to the chromosome-level assembly. There are a variety of tools that will do this, but I have mainly used two methods:

RaGOO, which is based on DNA sequence similarity and minimap2 alignments. This works well up to ~5 - 10% divergence between draft and reference genomes. I then convert RaGOO output to a common format using the custom perl script ConvertRagooToMapping.pl
OrderScaffoldsByBlatProteins.pl, which is based on best protein alignments from the reference genome to the draft genome.

However, gwplotting functions that utilize mapping information (e.g. reorder_scaffolds()) really only require the mapping file to contain four columns with these names:

scaf: the draft scaffold name
chr: the reference chromosome or scaffold that the draft scaffold is assigned to
strand: the orientation of the draft scaffold relative to the chromosome (+ or -)
median_pos: originally the median position of the match on the reference genome, this can be any number that gives the relative order of the draft scaffolds within the chromosomes.

Examples:

| scaf | scafLen | scafMin | scafMax | chr | chrMin | chrMax | strand | num_proteins | median_pos| |:------------|------:|-----:|------:|:------------|-----:|------:|:--|:---|--:| |scaffold903 | 10227 | 2723 | 10035 | Hmel200007o | 3662 | 13152 | + | NA | 1 | |scaffold728 | 21685 | 148 | 20708 | Hmel200008o | 5874 | 31779 | + | NA | 1 | |scaffold2427 | 7208 | 4206 | 6951 | Hmel200023o | 5365 | 6970 | - | NA | 1 |

...

| scaf | scafLen | scafMin | scafMax | scafMid | chr | chrMin | chrMax | chrMedian | strand | num_proteins | |:-----|--------:|--------:|--------:|--------:|:---|--------:|-------:|------------------:|:-------|--------------:| | scaffold706 | 23947 | 961 |7068 |4158.5 |Hmel200218o |45453| 54246| 49334.5 | +| 2 | | scaffold426 | 152346 | 22717 | 148311| 94437 |Hmel201001o| 3557003| 3685557| 3628260.5 |2| | scaffold545| 74324| 6079| 42055| 27244.75 |Hmel201001o| 4823317| 4871986| 4845220.75 |-| 2|

C. GFF and BED format files {#gff-bed-files}

GFF v3 and BED files containing features of interest can be supplied to some functions to add annotations to plots along the x-axis.

Key object

The file loading functions all produce a common, 4-column tibble containing (named) columns with scaffolds (scaf), scaffold positions (ps), statistics (stat), and chromosome assignments (chr). For example, reading in the example GEMMA GWA file would yield:

a <- system.file( "extdata", "test.gemma_gwas.txt.gz", package = "gwplotting" )

gwa_obj <- load_gemma_gwas( a, pval = 'p_wald' )

gwa_obj

So, any analysis whose results can be coerced into this type of object can be used as input to reordering or plotting functions. If you do not have chromosome assignments, simply add a dummy column of 1s. Note that this MUST BE A TIBBLE, not a data.frame, for use with these downstream functions.

Detailed examples

[@VanKuren2019] used GWA to identify loci controlling a phenotype switch in the polymorphic butterfly Papilio clytia. These butterflies develop one of two distinct wing color patterns that mimic distantly related toxic butterflies. Alternate alleles at a single locus, containing the gene cortex, determine which color pattern each P. clytia individual develops. We sequenced the genomes of 27 individuals, about half of each color pattern, then performed GWA using GEMMA. We will analyze 100,000 of those sites.

1. Plotting genome-wide results

Load GWA results using built-in loading functions, or load and manipulate into the final form. Just ensure that it's a tibble and that it has the columns scaf, ps, stat, and chr. For GWA, stat will be the raw p-value. We already loaded our GWA results above into gwa_obj.

We can plot the unordered results.

scafs <- system.file("extdata", "test.chromSizes.txt.gz", package = "gwplotting" )

unord_plot <- plot_genomewide_data( input = gwa_obj, type = 'gwas', scaffold_lengths = scafs, plotting_column = 'stat' )

unord_plot

This is fine, but usually uninformative for finding interesting regions because you have no idea if multiple peaks on different small scaffolds in reality correspond to a single peak. We can use a chromosome-level assembly for a closely-related species to assign our scaffolds to chromosomes. We can use the output of these assignment pipelines to order these draft scaffolds during plotting.

mapping <- system.file( "extdata", "test.reordering.txt.gz", package = "gwplotting" )

ordered_gwa <- reorder_scaffolds( input = gwa_obj, assignments = mapping, species = 'pxut' )

ord_plot <- plot_genomewide_data( input = ordered_gwa, type = 'gwas', scaffold_lengths = scafs, plotting_column = 'stat' )

ord_plot

Certainly the plots are not exactly what you want, but you can easily tweak them in Illustrator or Inkscape. You can save these plots using ggsave.

You will not always have chromosome mapping information, but you can still clean up your plot a little simply by re-ordering by scaffold length (longest to shortest) and removing short scaffolds using the reorder_by_scaf_len() function and plotting the same way.

2. Plotting region data

There appears to be a nice peak on chromosome 13 that we would want to investigate further. We can first plot chromosome 13 as a whole to get a sense for where the peak is. We can then zoom in further and add interesting features such as scaffolds and genes.

chr13_plot <- plot_region_data( input = ordered_gwa, chromosome = 13, type = 'gwas', color = 'grey', scaffold_lengths = scafs )

# plot chr13_plot. Peak is ~5.1 - 6.4 Mb
chr13_zoom <- plot_region_data( input = ordered_gwa, chromosome = 13, from = 5100000, to = 6400000, type = 'gwas', color = 'grey', scaffold_lengths = scafs )

cowplot::plot_grid( plotlist = list(chr13_plot, chr13_zoom), labels = c('A)','B)') )

We want to know where gene models and scaffolds are located in the zoomed-in plot. This is not perfect, and is really meant to be a starting point for finalizing in Inkscape or Illustrator.

gff_file <- system.file( "extdata", "test.gff.gz", package = "gwplotting" )

add_scafs <- add_annotations( )

nwvankuren/genomics-plotting documentation built on April 14, 2021, 1:18 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

nwvankuren/genomics-plotting
Utilities for plotting statistics from genome-wide analyses

In nwvankuren/genomics-plotting: Utilities for plotting statistics from genome-wide analyses

Introduction {#introduction}

Motivation

Important file types and objects

1. Key files

A. Scaffold lengths {#scaf-lens}

B. Mapping files {#mapping-files}

C. GFF and BED format files {#gff-bed-files}

Key object

Detailed examples

1. Plotting genome-wide results

2. Plotting region data

R Package Documentation

Browse R Packages

We want your feedback!

nwvankuren/genomics-plotting Utilities for plotting statistics from genome-wide analyses

In nwvankuren/genomics-plotting: Utilities for plotting statistics from genome-wide analyses

Introduction {#introduction}

Motivation

Important file types and objects

1. Key files

A. Scaffold lengths {#scaf-lens}

B. Mapping files {#mapping-files}

C. GFF and BED format files {#gff-bed-files}

Key object

Detailed examples

1. Plotting genome-wide results

2. Plotting region data

R Package Documentation

Browse R Packages

We want your feedback!

nwvankuren/genomics-plotting
Utilities for plotting statistics from genome-wide analyses