knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
2021-03
Pan-genome is the totality of genes found across members of a given group, revealing the diversity and functional potential within that group. PAV(Presence-absence variation) analysis is an essential step in pan-analysis. The core genome contains genes shared by all and the distributed genome consist of genes not shared by all. The distributed genome can be further divided into genes shared in most members(soft-core genes), genes shared between some members(distributed/accessory genes), and genes present in only one member(unique/private genes).
“Map-to-pan” is a common strategy for eukaryotic pan-genome analyses. Firstly, construct pan-genome by integrating the reference genome and non-reference sequences. Then, align reads to the pan-genome and examine gene coverage. Finally, determine gene presence/absence and analysis PAV table.
vPan is an efficient tool to explore and visualize the complex result in PAV analysis. It provides four modules to 1) display gene coverage distribution, 2) analyze and visualize PAV table, 3) simulate pan/core genome sizes, and 4) find phenotype-associated genes.
The workflow starts with an object of COV(module 1) or PAV(module 2-4) class. It can be produced by function get_cov_obj()
and get_pav_obj()
. It contains coverage/PAV matrix, arguments, phenotype information, and gene information, and will be the main input for the subsequent steps.
Module 1 focuses on showing the coverage and the distribution of coverage. Functions: cov_heatmap()
and cov_density()
.
Module 2 focuses on PAV analysis, including overview, classifying genes, clustring, PCA analysis. Functions: pav_heatmap()
, pav_hist()
, pav_stackbar()
, pav_cluster()
, pav_halfviolin()
and pav_pca()
.
Module 3 focuses on simulating pan/core/private genome sizes and drawing growth curve. Functions : sim_stat()
, sim_plot()
, sim_multi_stat()
and sim_multi_plot
.
Module 4 focuses on phenotype association and visualization. Functions: phen_heatmap()
, phen_manhattan()
, phen_block()
, phen_bar()
and phen_violin()
.
If you do not already have R/RStudio installed, please do as follows:
packages <- c("snowfall", "data.table", "ggdendro", "ggplot2", "ggrepel", "ggsignif", "randomcoloR") lapply(packages, function(x) { if(!require(x, character.only = TRUE)) { install.packages(x, dependencies = TRUE) }}) if (!requireNamespace("BiocManager", quietly = TRUE)){ install.packages("BiocManager") } BiocManager::install("ComplexHeatmap")
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") library(devtools) install_github("xiaonui/vPan", build_vignettes = TRUE)
COV
class The function get_cov_obj
is used to generate an object of COV
class. It requires a numeric matrix
or a data.frame
of coverage data. The row names are gene names and the column names are sample names. Besides, the phen_info
and gene_info
are optional. The phen_info
should be a data.frame
of phenotype and any other attribute about samples. The gene_info
should be a data.frame
of gene information (e.g. reference/novel, chromosomes, position).
PAV
class The function get_pav_obj
is used to generate an object of PAV
class. It requires a numeric matrix
or a data.frame
of PAV table. The 0
represents absence and 1
represents presence. The row names are gene names and the column names are sample names. The phen_info
and gene_info
are the same as 2.1 COV
class.
add_softcore
and add_private
are logical values indicating whether to consider "softcore" and "private" when determining the gene types.
If add_softcore
is TRUE
, the genes with loss rates not significantly larger than softcore_loss_rate
will be considered as "softcore gene". Binomial tests (with a null hypothesis of loss rate < softcore_loss_rate
) are carried out for each gene. A p-value below softcore_p_value
means that this gene is lost in a significant proportion and is a "distributed gene"(loss rate > softcore_loss_rate
).
If add_private
is TRUE
, the genes present in only one sample will be considered as "private gene".
The demo data is used to demonstrate the function. It comes from SGDP (Simons Genome Diversity Project, Mallick S, 2016). We align the reads to a pan-genome of human and obtain the CDS coverage. We selected 111 samples from Asia (EastAsia, SouthAsia and CentralAsiaSiberia) and genes on autosomes. In order to show the result of functions more clearly, we removed genes with a coverage of 100% in all samples.
"EastAsia", "SouthAsia", "CentralAsiaSiberia"
library(vPan) data("vPan_demo") head(cov_data[, 1:10]) head(gene_info_data) head(phen_info_data)
my_cov <- get_cov_obj(cov_data, gene_info = gene_info_data, phen_info = phen_info_data) pav_data <- ifelse(cov_data > 0.9, 1, 0) my_pav <- get_pav_obj(pav_data, gene_info = gene_info_data, phen_info = phen_info_data)
cov_heatmap()
The heatmap giving an overview of coverage. The color mapping is generated by setting cov_colors
. The row and column can be clustered. The cluster has some general settings, such as whether to apply clustering or show dendrograms, the side of the dendrograms, and the width of the dendrograms.
cov_heatmap(my_cov, cov_colors = c("white", "#F8766D"), cluster_rows = T, cluster_columns = T)
The phen_info
and gene_info
can be added to the figure. The anno_param_column_gene
and anno_param_row_phen
are the parameters list. The gene_info_color_list
and phen_info_color_list
are used to change the colors of annotations.
cov_heatmap(my_cov, cluster_rows = T, cluster_columns = T, add_phen_info = c("Genetic_sex", "Region"), phen_info_color_list = list( Genetic_sex = c("XX" = "#A6D854", "XY" = "#8DA0CB", "Not Assigned" = "gray70"), Region = structure(c("#66C2A5", "#FFD92F", "#FC8D62"), names = unique(phen_info_data$Region))), add_gene_info = c("length"), gene_info_color_list = list(length = c("#dbebfa", "#377EB8")))
cov_density()
Then, you can focus on serval genes of interest.
genes <- names(head(sort(apply(cov_data, 1, median)))) cov_density(my_cov, genes)
The gene_info
can be added.
cov_density(my_cov, genes, row_names_side = "right", add_gene_info = c("chr", "length"), gene_info_color_list = list(length = c("#dbebfa", "#377EB8")))
pav_halfviolin()
At first, you can observe the number of genes present in samples in a half-violin plot. The left half is the density estimate and each point in the right represents a sample.
pav_halfviolin(my_pav)
If you add phen_info
, the points will be grouped according to the phenotype.
pav_halfviolin(my_pav, add_phen_info = "Region")
pav_hist()
The genes can be divided into multiple types based on how many samples are shared. pav_hist()
integrate a ring chart and a histogram to showing the number of gene types. The ring_pos_x
, ring_pos_y
, and ring_r
specify the position and radius of the ring chart. The x-axis of the histogram is the number of samples, ranging from 1 to all samples. The y-axis is the number of genes shared by x
samples.
pav_hist(my_pav, ring_r = .45, y_title = "Number of genes")
pav_stackbar()
The composition of genes in all samples can be viewed in pav_stackbar()
. The chart consists of a hierarchical cluster tree and a bar plot.
The dend_width
and name_width
are the relative widths of dendrogram and sample names.
The dashed line and number labels indicate the mean value of cumulative sums. e.g. The first line is the mean of core gene number, the second line is the mean of (core + soft-core) gene number.
pav_stackbar(my_pav, name_width = .17, dend_width = .1, sample_name_size = 2)
If show_relative
is TRUE, the result is relative values.
pav_stackbar(my_pav, name_width = .17, dend_width = .1, sample_name_size = 2, show_relative = T)
If you add phen_info
, the sample names will be colored.
pav_stackbar(my_pav, name_width = .17, dend_width = .1, sample_name_size = 2, add_phen_info = "Region")
pav_heatmap()
Heatmap is an intuitive way to display total PAV data. pav_heatmap()
provides a heatmap and two summary annotations of presence genes. The anno_param_row_stat
and anno_param_column_stat
are the parameters list and you can hide annotation by setting list(show = FALSE)
.
The column will be split into blocks according to gene types. If split_block
is FALSE, the split will disappear. The name of blocks in the upper can be adjusted by block_name_size
and block_name_rot
.
pav_heatmap(my_pav, gene_type = c("core", "softcore", "distributed"), block_name_size = 7)
The rows and columns can be clustered. Please note that when the number of rows or columns is huge, it will take a long time.
pav_heatmap(my_pav, gene_type = c("softcore", "distributed"), split_block = FALSE, cluster_columns = TRUE, cluster_rows = TRUE, column_dend_side = "bottom", row_dend_side = "right")
If split_block
is TRUE and cluster_columns
is TRUE, clustering will be performed in each block.
pav_heatmap(my_pav, gene_type = c("softcore", "distributed"), split_block = TRUE, cluster_columns = TRUE)
The phen_info
and gene_info
can be added to the figure.
pav_heatmap(my_pav, gene_type = c("softcore", "distributed"), add_phen_info = c("Genetic_sex", "Region"), phen_info_color_list = list( Genetic_sex = c("XX" = "#A6D854", "XY" = "#8DA0CB", "Not Assigned" = "gray70"), Region = structure(c("#66C2A5", "#FFD92F", "#FC8D62"), names = unique(phen_info_data$Region))), add_gene_info = c("length"), gene_info_color_list = list(length = c("#dbebfa", "#377EB8")))
If you want to specify the order of rows/columns, you can set row_sorted
/column_sorted
. Please make sure to set cluster_columns
and split_block
to FALSE
to get the desired result.
pav_cluster()
If you want to just show clustering of samples without complex heatmap, you can use pav_cluster()
.
pav_cluster(my_pav, mult = .3, sample_name_size = 2)
If you add phen_info
, the sample names and lines will be colored.
pav_cluster(my_pav, mult = .3, sample_name_size = 2, add_phen_info = "Region", phen_info_color_list = list( Region = structure( c("#F8766D", "#00BA38", "#619CFF"), names = unique(phen_info_data$Region))))
pav_pca()
pav_pca()
will perform PCA analysis of PAV data using prcomp()
. The center
, scale
, and rank
will pass to prcomp()
.
pav_pca(my_pav)
If you add phen_info
, the sample points will be colored.
pav_pca(my_pav, add_phen_info = "Region")
sim_stat()
Simulation can be used to predict the size of the genome. The function sim_stat()
is used to generate simulation results. It supports three genome types: pan-genome, core genome, and private genome.
The simulation result is the input data for function sim_plot
to visualize. If parallel
is TRUE, package snowfall
will be used for parallel computing to save time.
my_sim <- sim_stat(my_pav, genome_type = c("pan", "core"))
sim_plot()
The sim_plot
offers four chart types: "box", "jitter", "ribbon" and "errorbar".
sim_plot(my_sim) sim_plot(my_sim, chart_type = "jitter") sim_plot(my_sim, chart_type = "ribbon") sim_plot(my_sim, chart_type = "errorbar")
sim_multi_stat
and sim_multi_plot
These two functions are similar to sim_stat()
and sim_plot()
. But they are used for grouping simulation.
my_sim_2 <- sim_multi_stat(my_pav, "Region") sim_multi_plot(my_sim_2)
phen_stat()
Phenotype association can help researchers to understand the potential biological functions.
For discrete values, the fisher's exact test (fisher.test()
) will be used to determine whether the distribution of each gene is uniform. For continuous values, Wilcoxon tests(wilcox.test()
) will be performed.
my_phen <- phen_stat(my_pav, c("Genetic_sex", "DNA_source", "Region", "Coverage_mean"), p_adjust_method = "fdr")
phen_heatmap()
The phen_heatmap()
showing the main result of phenotype association analysis using a heat map. It requires the PAV object and the result from phen_stat()
. The rows and columns represent genes and phenotypes. You can filp coordinates using flip
. If adjust_p
is TRUE, the adjusted p is used, otherwise use p-value. Genes with at least one p_value/p_adjust under p_threshold
will be displayed. The color mapping of p_value/p_adjust is generated by setting cov_colors
.
phen_heatmap(my_pav, my_phen, cell_border_color = "white", na_col = "gray90", flip = T, adjust_p = F, p_threshold = 0.1, column_names_size = 7)
By default, only the significant cell be mapping colored, the color of other cells is na_col
. You can cancel it by setting only_show_significant
.
phen_heatmap(my_pav, my_phen, cell_border_color = "white", na_col = "gray90", flip = T, adjust_p = F, p_threshold = 0.1, column_names_size = 7, only_show_significant = FALSE)
The gene_info
can be added to the figure.
phen_heatmap(my_pav, my_phen, cell_border_color = "white", na_col = "gray90", flip = T, adjust_p = T, p_threshold = 0.1, column_names_size = 7, add_gene_info = c("chr", "legnth"), anno_param_gene = list(name_rot = 0))
phen_manhattan
If you want to study a phenotype further, you can plot a manhattan plot using phen_manhattan()
. But it required chromosomes and position in the gene_info
of PAV
object. The p_value and p_ajusted can be chosen by adjsut_p
. The most significant genes will be highlighted and labeled.
phen_manhattan(my_pav, my_phen, "Region", "chr_n", "start", highlight_top_n = 10, highlight_text_size = 2, x_text_size = 5)
phen_block
If the phenotype is discrete values, the phen_block
can be used to observe the presence/absence genes in every group. The number in brackets represents the sample size.
phen_block(my_pav, my_phen, "Region", adjust_p = F, p_threshold = .1, row_names_size = 6, cell_border_color = "black")
By default, the p_value/p_adjusted will be showed in annotation. If you want to view other information in gene_info
, you can add them using 'add_gene_info'.
phen_block(my_pav, my_phen, "Region", adjust_p = F, p_threshold = .1, row_names_size = 6, cell_border_color = "black", add_gene_info = c("p", "chr"))
phen_bar
and phen_violin
These two functions focus on showing the relationship between a certain gene and a certain phenotype. The phen_bar()
designed for discrete values and the phen_violin()
designed for continuous values.
phen_bar(my_pav, "Region", sample(my_pav@gene$name[my_pav@gene$type == "distributed"], 1)) phen_violin(my_pav, "Coverage_mean", sample(my_pav@gene$name[my_pav@gene$type == "distributed"], 1))
Putting together multiple result graphs can be easy to observe.
do.call(gridExtra::grid.arrange, lapply(sample(my_pav@gene$name,10), function(x){ phen_bar(my_pav, "Region", x) })) do.call(gridExtra::grid.arrange, lapply(sample(my_pav@gene$name,10), function(x){ phen_violin(my_pav, "Coverage_mean", x) }))
Mallick S, Li H, Lipson M, et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016;538(7624):201-206. doi:10.1038/nature18964
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.