knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

2021-03

1. Introduction

1.1 Pan-genome and PAV analysis

  Pan-genome is the totality of genes found across members of a given group, revealing the diversity and functional potential within that group. PAV(Presence-absence variation) analysis is an essential step in pan-analysis. The core genome contains genes shared by all and the distributed genome consist of genes not shared by all. The distributed genome can be further divided into genes shared in most members(soft-core genes), genes shared between some members(distributed/accessory genes), and genes present in only one member(unique/private genes).

  “Map-to-pan” is a common strategy for eukaryotic pan-genome analyses. Firstly, construct pan-genome by integrating the reference genome and non-reference sequences. Then, align reads to the pan-genome and examine gene coverage. Finally, determine gene presence/absence and analysis PAV table.

1.2 The core processing workflow of vPan

  vPan is an efficient tool to explore and visualize the complex result in PAV analysis. It provides four modules to 1) display gene coverage distribution, 2) analyze and visualize PAV table, 3) simulate pan/core genome sizes, and 4) find phenotype-associated genes.

  The workflow starts with an object of COV(module 1) or PAV(module 2-4) class. It can be produced by function get_cov_obj() and get_pav_obj(). It contains coverage/PAV matrix, arguments, phenotype information, and gene information, and will be the main input for the subsequent steps.

2. Installation

2.1 Installing R/RStudio

  If you do not already have R/RStudio installed, please do as follows:

2.2 Check or install packages

packages <- c("snowfall", "data.table", "ggdendro", "ggplot2", "ggrepel", "ggsignif", "randomcoloR")
lapply(packages, function(x) {
    if(!require(x, character.only = TRUE)) {
        install.packages(x, dependencies = TRUE)
    }})
if (!requireNamespace("BiocManager", quietly = TRUE)){
  install.packages("BiocManager")
}
BiocManager::install("ComplexHeatmap")

2.3 Install metaFunc from github.

if (!requireNamespace("devtools", quietly = TRUE))
  install.packages("devtools")
library(devtools)
install_github("xiaonui/vPan", build_vignettes = TRUE)

3. Input data

2.1 COV class

  The function get_cov_obj is used to generate an object of COV class. It requires a numeric matrix or a data.frame of coverage data. The row names are gene names and the column names are sample names. Besides, the phen_info and gene_info are optional. The phen_info should be a data.frame of phenotype and any other attribute about samples. The gene_info should be a data.frame of gene information (e.g. reference/novel, chromosomes, position).

2.2 PAV class

  The function get_pav_obj is used to generate an object of PAV class. It requires a numeric matrix or a data.frame of PAV table. The 0 represents absence and 1 represents presence. The row names are gene names and the column names are sample names. The phen_info and gene_info are the same as 2.1 COV class.

  add_softcore and add_private are logical values indicating whether to consider "softcore" and "private" when determining the gene types.

2.3 Demo data

The demo data is used to demonstrate the function. It comes from SGDP (Simons Genome Diversity Project, Mallick S, 2016). We align the reads to a pan-genome of human and obtain the CDS coverage. We selected 111 samples from Asia (EastAsia, SouthAsia and CentralAsiaSiberia) and genes on autosomes. In order to show the result of functions more clearly, we removed genes with a coverage of 100% in all samples.

"EastAsia", "SouthAsia", "CentralAsiaSiberia"

library(vPan)
data("vPan_demo")
head(cov_data[, 1:10])
head(gene_info_data)
head(phen_info_data)
my_cov <- get_cov_obj(cov_data, gene_info = gene_info_data, phen_info = phen_info_data)

pav_data <- ifelse(cov_data > 0.9, 1, 0)
my_pav <- get_pav_obj(pav_data, gene_info = gene_info_data, phen_info = phen_info_data)

4. Module1: Coverage distribution

4.1 cov_heatmap()

  The heatmap giving an overview of coverage. The color mapping is generated by setting cov_colors. The row and column can be clustered. The cluster has some general settings, such as whether to apply clustering or show dendrograms, the side of the dendrograms, and the width of the dendrograms.

cov_heatmap(my_cov, 
            cov_colors = c("white", "#F8766D"),
            cluster_rows = T,
            cluster_columns = T)

  The phen_info and gene_info can be added to the figure. The anno_param_column_gene and anno_param_row_phen are the parameters list. The gene_info_color_list and phen_info_color_list are used to change the colors of annotations.

cov_heatmap(my_cov, 
            cluster_rows = T,
            cluster_columns = T,
            add_phen_info = c("Genetic_sex", "Region"), 
            phen_info_color_list = list(
              Genetic_sex = c("XX" = "#A6D854", "XY" = "#8DA0CB", "Not Assigned" = "gray70"), 
              Region = structure(c("#66C2A5", "#FFD92F", "#FC8D62"), 
                                 names = unique(phen_info_data$Region))),
            add_gene_info = c("length"),
            gene_info_color_list = list(length = c("#dbebfa", "#377EB8")))

4.2 cov_density()

  Then, you can focus on serval genes of interest.

genes <- names(head(sort(apply(cov_data, 1, median))))
cov_density(my_cov, genes)

  The gene_info can be added.

cov_density(my_cov, genes, 
            row_names_side = "right",
            add_gene_info = c("chr", "length"),
            gene_info_color_list = list(length = c("#dbebfa", "#377EB8")))

5. Module 2: PAV analysis

5.1 pav_halfviolin()

  At first, you can observe the number of genes present in samples in a half-violin plot. The left half is the density estimate and each point in the right represents a sample.

pav_halfviolin(my_pav)

  If you add phen_info, the points will be grouped according to the phenotype.

pav_halfviolin(my_pav, 
               add_phen_info = "Region")

5.2 pav_hist()

  The genes can be divided into multiple types based on how many samples are shared. pav_hist() integrate a ring chart and a histogram to showing the number of gene types. The ring_pos_x, ring_pos_y, and ring_r specify the position and radius of the ring chart. The x-axis of the histogram is the number of samples, ranging from 1 to all samples. The y-axis is the number of genes shared by x samples.

pav_hist(my_pav, 
         ring_r = .45,
         y_title = "Number of genes")

5.3 pav_stackbar()

  The composition of genes in all samples can be viewed in pav_stackbar(). The chart consists of a hierarchical cluster tree and a bar plot. The dend_width and name_width are the relative widths of dendrogram and sample names. The dashed line and number labels indicate the mean value of cumulative sums. e.g. The first line is the mean of core gene number, the second line is the mean of (core + soft-core) gene number.

pav_stackbar(my_pav,
             name_width = .17,
             dend_width = .1,
             sample_name_size = 2)

  If show_relative is TRUE, the result is relative values.

pav_stackbar(my_pav,
             name_width = .17,
             dend_width = .1,
             sample_name_size = 2,
             show_relative = T)

  If you add phen_info, the sample names will be colored.

pav_stackbar(my_pav, 
             name_width = .17,
             dend_width = .1,
             sample_name_size = 2,
             add_phen_info = "Region")

5.4 pav_heatmap()

  Heatmap is an intuitive way to display total PAV data. pav_heatmap() provides a heatmap and two summary annotations of presence genes. The anno_param_row_stat and anno_param_column_stat are the parameters list and you can hide annotation by setting list(show = FALSE).

  The column will be split into blocks according to gene types. If split_block is FALSE, the split will disappear. The name of blocks in the upper can be adjusted by block_name_size and block_name_rot.

pav_heatmap(my_pav, 
            gene_type = c("core", "softcore", "distributed"),
            block_name_size = 7)

  The rows and columns can be clustered. Please note that when the number of rows or columns is huge, it will take a long time.

pav_heatmap(my_pav,
            gene_type = c("softcore", "distributed"),
            split_block = FALSE,
            cluster_columns = TRUE, 
            cluster_rows = TRUE,
            column_dend_side = "bottom",
            row_dend_side = "right")

  If split_block is TRUE and cluster_columns is TRUE, clustering will be performed in each block.

pav_heatmap(my_pav, 
            gene_type = c("softcore", "distributed"),
            split_block = TRUE,
            cluster_columns = TRUE)

  The phen_info and gene_info can be added to the figure.

pav_heatmap(my_pav, 
            gene_type = c("softcore", "distributed"),
            add_phen_info = c("Genetic_sex", "Region"), 
            phen_info_color_list = list(
              Genetic_sex = c("XX" = "#A6D854", "XY" = "#8DA0CB", "Not Assigned" = "gray70"), 
              Region = structure(c("#66C2A5", "#FFD92F", "#FC8D62"), names = unique(phen_info_data$Region))),
            add_gene_info = c("length"),
            gene_info_color_list = list(length = c("#dbebfa", "#377EB8")))

  If you want to specify the order of rows/columns, you can set row_sorted/column_sorted. Please make sure to set cluster_columns and split_block to FALSE to get the desired result.

5.5 pav_cluster()

  If you want to just show clustering of samples without complex heatmap, you can use pav_cluster().

pav_cluster(my_pav,
            mult = .3,
            sample_name_size = 2)

  If you add phen_info, the sample names and lines will be colored.

pav_cluster(my_pav,
            mult = .3,
            sample_name_size = 2, 
            add_phen_info = "Region",
            phen_info_color_list = list(
              Region = structure( c("#F8766D", "#00BA38", "#619CFF"), names = unique(phen_info_data$Region))))

5.6 pav_pca()

  pav_pca() will perform PCA analysis of PAV data using prcomp(). The center, scale, and rank will pass to prcomp().

pav_pca(my_pav)

  If you add phen_info, the sample points will be colored.

pav_pca(my_pav, 
        add_phen_info = "Region")

Module 6: Simulation

6.1 sim_stat()

  Simulation can be used to predict the size of the genome. The function sim_stat() is used to generate simulation results. It supports three genome types: pan-genome, core genome, and private genome. The simulation result is the input data for function sim_plot to visualize. If parallel is TRUE, package snowfall will be used for parallel computing to save time.

my_sim <- sim_stat(my_pav, genome_type = c("pan", "core"))

6.2 sim_plot()

  The sim_plot offers four chart types: "box", "jitter", "ribbon" and "errorbar".

sim_plot(my_sim)
sim_plot(my_sim, chart_type = "jitter")
sim_plot(my_sim, chart_type = "ribbon")
sim_plot(my_sim, chart_type = "errorbar")

6.3 sim_multi_stat and sim_multi_plot

  These two functions are similar to sim_stat() and sim_plot(). But they are used for grouping simulation.

my_sim_2 <- sim_multi_stat(my_pav, "Region")
sim_multi_plot(my_sim_2)

7. Module4: Phenotype Association

7.1 phen_stat()

  Phenotype association can help researchers to understand the potential biological functions. For discrete values, the fisher's exact test (fisher.test()) will be used to determine whether the distribution of each gene is uniform. For continuous values, Wilcoxon tests(wilcox.test()) will be performed.

my_phen <- phen_stat(my_pav, 
                    c("Genetic_sex", "DNA_source", "Region", "Coverage_mean"),
                    p_adjust_method = "fdr")

7.2 phen_heatmap()

  The phen_heatmap() showing the main result of phenotype association analysis using a heat map. It requires the PAV object and the result from phen_stat(). The rows and columns represent genes and phenotypes. You can filp coordinates using flip. If adjust_p is TRUE, the adjusted p is used, otherwise use p-value. Genes with at least one p_value/p_adjust under p_threshold will be displayed. The color mapping of p_value/p_adjust is generated by setting cov_colors.

phen_heatmap(my_pav, 
             my_phen,
             cell_border_color = "white",
             na_col = "gray90",
             flip = T,
             adjust_p = F,
             p_threshold = 0.1,
             column_names_size = 7)

  By default, only the significant cell be mapping colored, the color of other cells is na_col. You can cancel it by setting only_show_significant.

phen_heatmap(my_pav, 
             my_phen,
             cell_border_color = "white",
             na_col = "gray90",
             flip = T,
             adjust_p = F,
             p_threshold = 0.1,
             column_names_size = 7,
             only_show_significant = FALSE)

  The gene_info can be added to the figure.

phen_heatmap(my_pav, 
             my_phen,
             cell_border_color = "white",
             na_col = "gray90",
             flip = T,
             adjust_p = T,
             p_threshold = 0.1,
             column_names_size = 7,
             add_gene_info = c("chr", "legnth"),
             anno_param_gene = list(name_rot = 0))

7.3 phen_manhattan

  If you want to study a phenotype further, you can plot a manhattan plot using phen_manhattan(). But it required chromosomes and position in the gene_info of PAV object. The p_value and p_ajusted can be chosen by adjsut_p. The most significant genes will be highlighted and labeled.

phen_manhattan(my_pav, my_phen, "Region", "chr_n", "start", highlight_top_n = 10, highlight_text_size = 2,
               x_text_size = 5)

7.4 phen_block

  If the phenotype is discrete values, the phen_block can be used to observe the presence/absence genes in every group. The number in brackets represents the sample size.

phen_block(my_pav, my_phen, "Region", 
           adjust_p = F,
           p_threshold = .1, 
           row_names_size = 6,
           cell_border_color = "black")

  By default, the p_value/p_adjusted will be showed in annotation. If you want to view other information in gene_info, you can add them using 'add_gene_info'.

phen_block(my_pav, my_phen, "Region", 
           adjust_p = F,
           p_threshold = .1, 
           row_names_size = 6,
           cell_border_color = "black", 
           add_gene_info = c("p", "chr"))

7.5 phen_bar and phen_violin

  These two functions focus on showing the relationship between a certain gene and a certain phenotype. The phen_bar() designed for discrete values and the phen_violin() designed for continuous values.

phen_bar(my_pav, "Region", sample(my_pav@gene$name[my_pav@gene$type == "distributed"], 1))
phen_violin(my_pav, "Coverage_mean", sample(my_pav@gene$name[my_pav@gene$type == "distributed"], 1))

  Putting together multiple result graphs can be easy to observe.

do.call(gridExtra::grid.arrange, 
        lapply(sample(my_pav@gene$name,10), function(x){
          phen_bar(my_pav, "Region", x)
        }))
do.call(gridExtra::grid.arrange, 
        lapply(sample(my_pav@gene$name,10), function(x){
          phen_violin(my_pav, "Coverage_mean", x)
        }))

Reference

Mallick S, Li H, Lipson M, et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016;538(7624):201-206. doi:10.1038/nature18964



xiaonui/vPan documentation built on Dec. 23, 2021, 6:17 p.m.