knitr::opts_chunk$set( dev = "png", #dev.args = list(type = "cairo-png"), collapse = TRUE, comment = "#>" )
This package was originally designed for the Post-analysis of CNV results inferred from PennCNV and CNVPartition (GenomeStudio). However, it has now been expanded to accept input files in standard formats for a wider range of applications. Our motivation is to provide a standard, reproducible and time-saving pipeline for the post-analysis of CNVs and ROHs detected from SNP genotyping data for the majority of diploid Species. The functions provided in this package can be categorised into five sections: Conversion, Summary, Annotation, Comparison and Visualization. The most useful features provided are: integrating summarized results, generating lists of CNVR, annotating the results with known gene positions, plotting CNVR distribution maps, and producing customised visualisations of CNVs and ROHs with gene and other related information on one plot. This package also supports a range of customisations, including the colour, size of high resolution figures, and output folder, avoiding conflict between the results of different runs. Running through all functions detailed in the vignette could help us to identify and explore the most interesting genomic regions more easily. In the following sections, we will present how to use the functions provided in this package to solve these problems.
First, to run this package, we need to make sure that R (Version >= 3.5.2) is installed in your computer (R download link: https://www.r-project.org/). Once R is installed, the 'HandyCNV' package can be installed from Github repository by running the following script. If you rarely used R, it may take more time to install the 'HandyCNV' for the first time.
#install.packages("remotes") #library(remotes) #install_github(repo = "JH-Zhou/HandyCNV")
Then, we need to load the 'HandyCNV' package in order to run the following examples. This can be done using the library
function as shown below.
library(HandyCNV)
To start playing with this package, we first need to prepare at least one CNV result list. With only CNV results as an input file, we can explore the functions from section 1 to section 10 as below. But to get more interesting and potentially valuable results, we will have to prepare additional input files, including the map file and signal intensity file for the SNP chip used to generate the CNV list, the pedigree of the samples, and Plink format (bed, bim and fam) genotype files. Many of these files are already required during CNV discovery, and by generating them early in the pipeline, you will find that the rest of workflow will be much easier and faster. We will be introducing how to prepare each of the input files in the relevant chunks below.
We have provided some internal demo data, which should installed with package, in order to demonstrate how to use this package. We can access the demo data using the built-in R function system.file
(see examples below). In the demo data, the first CNV results file is the default output from PennCNV with the ARS-UCD 1.2 (ARS) cattle reference genome. The second CNV results file contains the default output from CNVPartition with UMD 3.1 (UMD) cattle reference genome, and the third CNV list is an example of a standard input file that could be prepared from CNV results inferred by other tools.
Note: If you prefer not to test with this demo data, just skip this step to section 1 and use your own data instead.
cnv_penn_ars <- system.file("extdata", "Demo_data/cnv_results/ARS_PennCNV_WGC.goodcnv", package = "HandyCNV") cnv_part_umd <- system.file("extdata", "Demo_data/cnv_results/UMD_CNVPartition.txt", package = "HandyCNV") cnv_other_umd <- system.file("extdata", "Demo_data/cnv_results/UMD_Standard_CNV.txt", package = "HandyCNV")
To start the analysis, let's first set up the working directory. This will help ensure that all results will be saved in the same directory.
setwd(dir = "C:/Users/handy_test") #remember to replace the path with your own
As above we mentioned, the functions in the package have been categorised into five sections: Conversion, Summary, Annotation, Comparison and Visualization.
Functions in the Conversion section can, for example, be used to convert coordinates between the default and target mapfiles provided by users; this would be useful if we are detecting CNVs with two different reference genomes. The converted map files can be output in formats suitable for use in Plink, PennCNV and our package 'HandyCNV'. It is also possible to convert coordinates for intervals (CNV, CNVR, QTL et al) between the defaults and target maps, which might useful when comparing the results to those of other researchers. Functions included in the Summarization section allow the formatting and plotting of outputs, such as producing detailed summaries of CNVs and ROHs, making CNV summary plots, generating CNV regions (CNVRs), and plotting CNVR distribution maps.
The Annotation section allows the downloading of reference gene lists from the UCSC website, enabling the annotation of CNV, CNVR, ROH or other intervals with gene locations. The Comparison section facilitates the comparison of CNVs, CNVRs, Gene Frequent Lists and any other intervals. The highlight of this section is that comparison of CNVs and CNVRs will produce reports detailing comparison results on both individual and population levels, and all comparison results are relative to both input files. Finally, the Visualization section contains functions to support customised CNV and ROH plotting by chromosome, specific sample, regions of interest, or target genes (plotting by gene is not yet available for ROH). It is also possible to annotate the plots with other information, such as gene locations, log R ratio, B Allele Frequency, SNP genotype, LD or the source of CNVs. These functions can also plot genes separately from reference gene lists, which might useful for comparing outputs with plots from other studies.
Now, let's start exploring the functions in HandyCNV
.
HandyCNV
and get a quick summary?We can use cnv_clean
function to convert cnv results to standard format. This function will also produce a brief summary of CNV results. First, let's format the sample CNV results from PennCNV. Note:
a. When formatting results in PennCNV format, the corresponding CNV input argument is 'penncnv'.
b. Default PennCNV results in Sample ID columns usually have the file path ahead the Sample ID. For example with the file "/home/zjhang027/C/final_403_penncnv/201094560082_R03C01.txt", the pattern 'cnv/' before the sample ID is unique, so we can take it as the separator using the argument penn_id_sep = "cnv/"
to extract the wanted sample ID.
cnv_penn_ars <- cnv_clean(penncnv = cnv_penn_ars, #replace input file when running your own data penn_id_sep = "cnv/", #take the unique pattern as a separator to extract the sample ID drop_length = 5, # the maximum CNV length threshold (in Mb); CNVs larger than this value will be deleted folder = "cnv_clean") #create a new folder to save results
Next, let's format a second set of CNV results, this time produced using CNVPartition. Note: The corresponding input argument is cnvpartition
for CNV result from CNVPartition.
cnv_part_umd <- cnv_clean(cnvpartition = cnv_part_umd, #when running your own data, set this argument by 'cnvpartition = "localpath/file"' drop_length = 5, #the maximum CNV length threshold (in Mb); CNVs larger than this value will be deleted folder = "cnv_clean")
If the CNV results were not produced by PennCNV or CNVPartition, we need to arrange the CNV results into the standard format, then use the standard_cnv
argument instead. The standard CNV input list should include the following columns (in order): Sample_ID
, Chr
, Start
, End
and CNV_Value
.
library(knitr) kable(data.frame(Sample_ID = c("201977910077_R05C01", "201094560084_R06C02", "201094560084_R06C02", "201094560067_R08C02", "201094560067_R08C02"), Chr = c(6, 7, 7, 4, 4), Start = c(9087392, 28583247, 37214745, 6248795, 66567441), End = c(9250902, 28621254, 37238472, 6425277, 66830563), CNV_Value = c(1, 3, 1, 3, 3)))
Let's now run cnv_clean
on the third Standard CNV input file. Note that we are using the argument standard_cnv
for the third CNV result.
cnv_standard <- cnv_clean(standard_cnv = cnv_other_umd, #when running your own data, set this argument by 'standard_cnv = "localpath/file"' drop_length = 5, #the maximum CNV length threshold (in Mb); CNVs larger than this value will be deleted folder = "cnv_clean")
The cnv_visual
function supports visualising CNVs by chromosome, single sample, regions of interest, and target genes.
Once we got the standard clean_cnv
result, we can use the cnv_visual
function to plot all CNVs, to have a quick look if there are any notable patterns in CNV results.
First, let's visualise the whole PennCNV dataset across all chromosomes:
cnv_visual(clean_cnv = cnv_penn_ars, #standard file was generated by 'cnv_clean' function in section 1 max_chr = 29, #select how many chromosomes to plot width_1 = 20, #optional,adjust the width of final plot height_1 =12, #optional,adjust the height of final plot folder = "cnv_visual_all_penn")
{width=320px}
Let's now visualise CNVs from the CNVPartition results by chromosome. We can adjust the size of final figure by setting 'width_1' and 'height_1'.
cnv_visual(clean_cnv = cnv_part_umd, #generated by 'cnv_clean' function in section 1 max_chr = 29, #select how many Chromosomes to plot width_1 = 20, #optional,adjust the width of final plot height_1 = 12, #optional,adjust the height of final plot folder = "cnv_visual_all_part", )
{width=320px}
After plotting all CNVs on each chromosome, we might find that some regions are particularly interesting. To plot these, we can add the options chr_id
, start_position
and end_position
to the cnv_visual
function call. Here, we are taking the region of Chr15:77-80.1221 Mb from the PennCNV results as an example.
cnv_visual(clean_cnv = cnv_penn_ars, #standard file was generated by 'cnv_clean' function in section 1 chr_id = 15, #set Chromosome start_position = 77, #set start position, unit is 'Mb' end_position = 80.1221, #set end position, unit is 'Mb' col_0 = "red", #optional, customize colour for 0 copy CNVs col_1 = "orange", #optional, customize colour for 1 copy CNVs col_3 = "blue", #optional, customize colour for 3 copy CNVs col_4 = "green", #optional, customize colour for 4 copy CNVs width_1 = 13, #optional, adjust the width of final plot height_1 = 10, #optional, adjust the height of final plot folder = "cnv_visual")
{width=320px}
Here we need to assign both the Sample ID (to individual_id
) and the clean CNV list (to clean_cnv
) in the cnv_visual
function. The individual CNV plot will saved in the current working directory. For example, the sample "201094560076_R03C02" has the most CNVs in the PennCNV results, so let's plot it and have a look. Note: we can use the individual_cnv_count.txt
file (to be generated in section 3) to identify samples that exhibit large numbers of CNVs.
cnv_visual(clean_cnv = cnv_penn_ars, #standard file was generated by 'cnv_clean' function in section 1 individual_id = "201094560076_R03C02", #set the sample ID width_1 = 20, #optional, adjust the width of final plot height_1 = 13, #optional. adjust the height of final plot folder = "cnv_visual")
{width=320px}
The idea of plotting CNVs by target gene is to avoid making overly-strong statements when reporting important candidate genes after annotation for CNVRs. If we think that one particular gene is especially important in the CNV analysis, it's better to present the frequency at which that gene has a structural variation in the experimental population. Plotting CNVs by gene will directly present how many samples show copy number variations overlapping the gene.
Note: The input file 'cnv_annotation.txt' will be generated in section 5.2, as part of annotating the gene to CNV results.
cnv_visual(clean_cnv = "call_gene_Penn_UCSC/cnv_annotation.txt", #here we need the annotated CNV list, generated by 'call_gene' in section 5.2 target_gene = "BLA-DQB", #set the gene name we are interested in col_0 = "darkorchid4", #optional, customize colour for o copy CNV col_1 = "dodgerblue3", #optional, customize colour for 1 copy CNV col_3 = "violetred3", #optional, customize colour for 3 copy CNV col_4 = "orangered3", #optional, customize colour for 4 copy CNV width_1 = 13, #adjust the width of final plot height_1 = 10, #adjust the width of final plot folder = "cnv_visual")
{width=320px}
To do this, we need to provide two input files to the cnv_visual
function: the CNV list in clean_cnv
and gene annotations in refgene
. We should also set the genomic region we are interested in. The reference gene list will be generated by the get_refgene
function in section 5.
cnv_visual(clean_cnv = cnv_penn_ars, #standard file was generated by 'cnv_clean' function in section 1 refgene = "refgene/Cow_ARS_UCSC.txt", #standard input file will generated by the function 'get_refgene' in Section 5 chr_id = 23, #set Chromosome start_position = 25.56, #set start position, unit is 'Mb' end_position = 25.74, #set end position, unit is 'Mb' gene_font_size = 2.2, #optional, adjust the size of gene font show_name = c(25.66,27.72), #optional, show the gene in the given interval, maximum support two intervals col_0 = "tomato", #optional, customize colour for 0 copy CNV col_1 = "hotpink", #optional, customize colour for 1 copy CNV col_3 = "turquoise", #optional, customize colour for 3 copy CNV col_4 = "aquamarine", #optional, customize colour for 4 copy CNV col_gene = "red", #optional, customize colour of gene width_1 = 14, #optional, adjust the width of final plot height_1 = 10, #optional, adjust the height of final plot folder = "cnv_visual")
{width=320px}
Using the clean CNV results, we can run the cnv_summary_plot
function to generate a range of summary plots. These plots can aggregate CNV results by length group, CNV type, chromosome, and individual. It is also possible to make two different types of plots combining these figures.
cnv_summary_plot(clean_cnv = cnv_penn_ars, #standard file was generated by 'cnv_clean' function in section 1 length_group = c(0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1, 2, 5), # optional, set group of vectors to divide CNV length, unit is Mb. such as vector of ‘c(0.05, 0.3, 1)’, means divide the CNV length into four group: '<0.05Mb', '0.05 - 0.3Mb', '0.3-1Mb' and '>1Mb', maximum can accept 11 values col_0 = "red", #optional, customize color for o copy CNV col_1 = "black", #optional, customize color for o copy CNV col_3 = "purple", #optional, customize color for o copy CNV col_4 = "blue", #optional, customize color for o copy CNV plot_sum_1 = TRUE, #optional, make sum combination plot 1 height_sum1 = 26, #optional, adjust the size of sum plot 1 width_sum1 = 20, #optional, adjust the size of sum plot 1 plot_sum_2 = TRUE, #optional, make sum combination plot 2 height_sum2 = 20, #optional, adjust the size of sum plot 2 width_sum2 = 27, #optional, adjust the size of sum plot 2 folder = "cnv_summary_plot_penn")
{width=320px}
{width=320px}
CNV regions are defined as regions of the genome containing one or more CNVs that overlap by at least one base pair: s a result, CNVRs will never overlap one another. As an example, let's first generate CNVRs for the clean PennCNV results, using the call_cnvr
function. A key parameter of this function is chr_set
, which allows setting the maximum number of chromosomes over which to generate CNVRs. In our example, the species is Bos taurus (cows), with 29 autosomes, so we set chr_set = 29
. This function will output three tables:
individual_cnv_cnvr.txt
.The resulting CNVRs are sorted by chromosome, start position, and end position, then assigned the CNVR ID to each unique interval by order. One of the output files, individual_cnv_cnvr.txt
contain both CNVR and CNV information for every sample: this file might suitable for CNV-GWAS analyses.
cnvr_penn_ars <- call_cnvr(clean_cnv = cnv_penn_ars, #standard file was generated by 'cnv_clean' function in section 1 chr_set = 29, #Set the maximum number of chromosomes (29 autosomes for cattle) folder = "call_cnvr_penn")
Next, let's generate CNVRs for CNVPartition and standard CNV lists. In order to avoid mixing the results among different files, we can set the folder
argument to assign a distinguishing name to the newly generated folder.
call_cnvr(clean_cnv = cnv_part_umd, #standard file was generated by 'cnv_clean' function in section 1 chr_set = '29', folder = "call_cnvr_part") call_cnvr(clean_cnv = cnv_standard, #standard file was generated by 'cnv_clean' function in section 1 chr_set = 29, folder = "call_cnvr_standard")
Once we have defined some CNV regions, we might want to see how many genes are located each. We can annotate genes to both CNVs and CNVRs using the call_gene
function. Before that, however, we need to prepare gene lists for the correct reference genome. This can be done using the get_refgene
function, as described in the following sections.
We can use the get_refgene
function to automatically download the reference gene list and convert it to the standard format required for functions such as call_gene
. First, let's run get_refgene
without arguments to see which reference gene lists we can download. If you don't find the information you're looking for, please feel free to contact our maintainer to update it to you. For now, only reference gene lists from the UCSC website are supported; however, we may add more features if there is sufficient demand.
get_refgene()
Now, let's get the reference gene list for the PennCNV results. This dataset was generated using ARS-UCD 1.2 as reference genome. Two gene lists are available for this reference: cow_ARS_UCSC
and cow_ARS_ENS
. The suffix ENS
indicates the Ensembl reference gene list. The gene list will be saved in the folder refgene
, which will be created first if necessary.
Cow_ARS_UCSC <- get_refgene(gene_version = "Cow_ARS_UCSC") Cow_ARS_ENS <- get_refgene(gene_version = "Cow_ARS_ENS")
Next let's get reference gene list for the CNVPartition and Standard result sets, both of which were generated on the UMD 3.1 reference genome; in this case, Cow_UMD_UCSC
is the correct option to select.
COW_UMD_UCSC <- get_refgene(gene_version = "Cow_UMD_UCSC")
Now we have well-prepared reference gene lists, let's return to the call_gene
function.
call_gene
functionThe call_gene
function has three arguments:
ref_gene
: this argument is required, and takes the path of a downloaded gene set (in the refgene
folder).interval
: an optional file containing a list of CNV regions. This argument can also point to any file storing interval data in the correct format; examples could include CRVs, QTL windows, or haplotype lists.clean_cnv
: an optional file containing a list of CNVsWe suggest that both 'interval' and 'clean_cnv' should be set simultaneously: this will enable reporting of the frequency of all annotated genes by counting how many CNVs overlapped to each gene within a CNVR. In section 10, we will show how to use the output file gene_freq_cnv.txt
to make comparisons between different CNV results to find sets of consensus genes.
As an example, we will first call genes for the PennCNV results, ensuring that the gene list for the correct reference has been set (ARS 1.2). In this case, we are using the Ensembl-provided gene list.
call_gene(refgene = "refgene/Cow_ARS_ENS.txt", #standard file was generated by 'get_refgene' function in section 5.1 interval = cnvr_penn_ars, #optional, standard file was generated by 'call_cnvr' function in section 4 clean_cnv = cnv_penn_ars, #optional, standard file was generated by 'cnv_clean' function in section 1 folder = "call_gene_Penn_Ens")
We can also use the UCSC gene list:
call_gene(refgene = "refgene/cow_ARS_UCSC.txt", #standard file was generated by 'get_refgene' function in section 5.1 interval = "call_cnvr_Penn/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 clean_cnv = "cnv_clean/penncnv_clean.cnv", #optional, standard file was generated by 'cnv_clean' function in section 1 folder = "call_gene_Penn_UCSC")
Comparing the results, we observe that the results from two reference gene lists are slightly different, likely because the maintainers of each database have different methodologies (and possibly data sources) for producing their gene annotations. As a result of this, the position and quantity of genes might slightly differ between gene builds even for the same reference genome. Therefore, it may be important to validate genes of interest in more than one database.
Now, let's call genes for the CNVPartition results. As before, we provide both CNVR and CNV lists, and set the new folder name by arguments. Note that we need to use the UMD 3.1 reference build for this data set. We also run the standard CNV set (using the ARS1.2 reference).
call_gene(refgene = "refgene/cow_UMD_UCSC.txt", #standard file was generated by 'get_refgene' function in section 5.1 interval = "call_cnvr_Part/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 clean_cnv = "cnv_clean/cnvpart_clean.cnv", #optional, standard file was generated by 'cnv_clean' function in section 1 folder = "call_gene_Part") call_gene(refgene = "refgene/cow_ARS_UCSC.txt", #standard file was generated by 'get_refgene' function in section 5.1 interval = "call_cnvr_standard/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 clean_cnv = "cnv_clean/cleancnv.cnv", #optional, standard file was generated by 'cnv_clean' function in section 1 folder = "call_gene_standard_ARS_UCSC")
The function cnvr_plot
is used to plot the CNVR distribution across chromosomes. For this function, we only need to assign the a CNVR list file to the cnvr
argument, and the reference genome assembly to the assembly
argument (to provide chromosome lengths). For Bovine data, we provide data for both ARS and UMD. For other species with data available (see the ref_gene
function in section 5.1), we can just assign the species name to the assembly
argument, for example, assembly = "hg38"
for human data.
First, we'll plot the CNVR distribution for the PennCNV data set. The default legend position might not suitable for all species; the arguments legend_x
and legend_y
can be used to move it if necessary.
cnvr_plot(cnvr = "call_cnvr_Penn/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 assembly = "ARS", #cattle could select 'UMD' or 'ARS', other species could assign the name of species loss_col = "green2", #optional, adjust the color of Loss type of CNVR gain_col = "red", #optional, adjust the color of Gain type of CNVR mixed_col = "black", #optional, adjust the color of Mixed type of CNVR legend_x = 127, #optional, adjust the horizontal position of legend legend_y = 30, #optional, adjust the vertical position of legend #height_1 = 14, #optional, adjust the height of CNVR plot #width_1 = 10, #optional, adjust the width of CNVR plot folder = "cnvr_plot_penn")
{width=320px}
Next, we plot the CNVRs derived from the CNVPartition data set.
cnvr_plot(cnvr = "call_cnvr_Part/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 assembly = "UMD", #cattle could select 'UMD' or 'ARS', other species could assign the name of species loss_col = "darkorchid", #optional, adjust the color of Loss type of CNVR gain_col = "gold", #optional, adjust the color of Gain type of CNVR mixed_col = "tomato", #optional, adjust the color of Mixed type of CNVR legend_x = 127, #optional, adjust the horizontal position of legend legend_y = 30, #optional, adjust the vertical position of legend height_1 = 14, #optional, adjust the height of CNVR plot width_1 = 17, #optional, adjust the width of CNVR plot folder = "cnvr_plot_part")
{width=320px}
It always interesting to see which genomic regions have the highest frequencies of structure variation. To view these, we can plot all the high frequency regions using the cnvr_plot
function.
As an example, let's plot the PennCNV results. Remember to set a new folder name to save the results.
cnvr_plot(cnvr = "call_cnvr_Penn/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 clean_cnv = "cnv_clean/penncnv_clean.cnv", #standard file was generated by 'cnv_clean' function in section 1 refgene = "refgene/cow_ARS_ENS.txt", #standard file was generated by 'get_refgene' function in section 5.1 sample_size = 382, #total sample size common_cnv_threshold = 0.18, #the CNV frequency threshold width_1 = 10, #optional, adjust the width of final plot height_1 = 10, #optional, adjust the height of final plot col_gene = "blue", #optional, adjust the colour of annotated genes gene_font_size = 2.5, #optional, adjust the font size of annotated gene names folder = "high_freq_cnvr_penn")
{width=320px}
Typically, we will use multiple models and software packages when predicting and analysing CNVs, especially when using data based on SNP chip genotypes. As a result, it is very common that we will need to compare two or more sets of predicted CNVs. To tackle this problem, we have implemented the compare_cnv
function to make comparisons on both the individual and population levels. At the individual level, the function checks how many CNVs overlap in the same individual between two CNV result sets, thereby verifying the level of similarity between the two sets of results. On the population level, the comparison will check the level of overlap through the whole sample population, without checking whether or not the CNVs are present in the same individual; this can better indiciate the degree of repeatability of CNV detection between the two result sets.
As an example let's compare the CNV prediction results for PennCNV and CNVPartition. Note that these results were generated using different reference builds, so the results of the comparison are currently rather meaningless (the physical coordinates of the CNVs will not line up across reference builds): this problem will be fixed in section 11 below.
compare_cnv(cnv_def = cnv_part_umd, #standard file was generated by 'cnv_clean' function in section 1 cnv_tar = cnv_penn_ars, #standard file was generated by 'cnv_clean' function in section 1 width_1 = 12, #optional, adjust the width of final plot height_1 = 11, #optional, adjust the height of final plot legend_x = 0.9, #optional, adjust the horizontal position of final plot legend_y = 0.9, #optional, adjust the vertical position of final plot col_1 = "pink", #optional, adjust the colour of Non-Overlapped part col_2 = "lightblue", #optional, adjust the colour of Overlapped part plot_caption = TRUE, #optional, show the Note information in plot or not folder = "compare_cnv_PartUMD_Vs_PennUMD")
{width=330px}
Similar to the compare_cnv
function, we can use the compare_cnvr
function to compare CNV regions. The same caveats apply with regard to the ensuring the two CNVR data sets were built on the same reference genome. Let's try to use compare_cnvr
now with the CNVPartition and Standard data sets (as before, ignoring for now the fact that these were produced on different reference genomes).
compare_cnvr(cnvr_def = "call_cnvr_Part/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 cnvr_tar = "call_cnvr_standard/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 width_1 = 8, #optional, adjust the width of comparison plot height_1 = 8, #optional, adjust the height of comparison plot hjust_prop = 0.2, #optional, adjust the horizontal position of the proportion number present in plot hjust_num = 1.6, #optional, adjust the horizontal position of the number present in plot col_1 = "grey15", #optional, adjust the colour of non-overlapped part col_2 = "forestgreen", #optional, adjust the colour of overlapped part folder = "compare_cnvr_PartUMD_Penn_UMD")
{width=320px}
One of the most interesting parts of CNV analysis is to find out which genes are affected by structure variations, and to check out the frequency of these aberrant genes in our samples. After using the call_gene
function in section 5.2, we now have the gene frequency list for each CNV result set. We can take these lists as input in the compare_gene
function to find the set of consensus genes common to multiple CNV results.
Now, let's compare the annotated gene lists for the PennCNV and CNVPartition results as an example. We can assign the arguments gene_freq_1
and gene_freq_2
with the gene_freq_cnv.txt
files from the call_gene_Part
and call_gene_Penn
output folders created earlier. Then we need to set an integer threshold (common_gene_threshold
) for the number of observations required to declare that a gene is "commom". In our example, the sample size is 400, so for a prevalence of 5%, the common_gene_threshold
could calculated as 400 * 5% = 20
.
compare_gene(gene_freq_1 = "call_gene_Part/gene_freq_cnv.txt", #standard file was generated by 'call_gene' function in Section 5.2 gene_freq_2 = "call_gene_Penn_Ens/gene_freq_cnv.txt", #standard file was generated by 'call_gene' function in Section 5.2 title_1 = "Bovine Part", #optional, adjust the title of input file 1 present in final plot title_2 = "Bovine Penn", #optional, adjust the title of input file 2 present in final plot common_gene_threshold = 20, #optional, the threshold is the gene frequency, here 20 means the Gene frequency exceed 20 in two input lists will be regarded as common gene. col_1 = "red", #optional, set the colour of 'Common High' genes col_2 = "blue", #optional, set the colour of 'Common Low' genes col_3 = "gold", #optional, set the colour of 'gene_freq_1'-only genes col_4 = "green", #optional, set the colour of 'gene_freq_2'-only genes height_1 = 10, #optional, adjust the height of final plot width_1 = 12, #optional, adjust the width of final folder = "compare_gene_Part_Vs_Penn")
{width=300px}
We can also compare the frequency lists for three genes, just by adding one more input file (using the gene_freq_3
argument); in this case, the comparison figure will be a 3D plot.
compare_gene(gene_freq_1 = "call_gene_Part/gene_freq_cnv.txt", #standard file was generated by 'call_gene' function in Section 5.2 gene_freq_2 = "call_gene_Penn_Ens/gene_freq_cnv.txt", #standard file was generated by 'call_gene' function in Section 5.2 gene_freq_3 = "call_gene_standard_ARS_UCSC/gene_freq_cnv.txt", #standard file was generated by 'call_gene' function in Section 5.2 title_1 = "Part", #optional, adjust the title of input file 1 present in final plot title_2 = "ENS", #optional, adjust the title of input file 2 present in final plot title_3 = "UCSC", #optional, adjust the title of input file 3 present in final plot common_gene_threshold = 15, height_1 = 5.8, #optional, adjust the height of final plot width_1 = 5.8, #optional, adjust the width of final plot folder = "compare_gene_three_lists")
{width=320px}
If we add a fourth gene frequency lists, by adding the gene_freq_4
argument, the comparison figure produced will be a heat-map rather than a scatterplot.
compare_gene(gene_freq_1 = "call_gene_Part/gene_freq_cnv.txt", #standard file was generated by 'call_gene' function in Section 5.2 gene_freq_2 = "call_gene_Penn_UCSC/gene_freq_cnv.txt", #standard file was generated by 'call_gene' function in Section 5.2 gene_freq_3 = "call_gene_Part/gene_freq_cnv.txt", #standard file was generated by 'call_gene' function in Section 5.2 gene_freq_4 = "call_gene_standard_ARS_UCSC/gene_freq_cnv.txt", #standard file was generated by 'call_gene' function in Section 5.2 title_1 = "Part", #optional, adjust the title of input file 1 present in final plot title_2 = "ENS", #optional, adjust the title of input file 2 present in final plot title_3 = "UCSC", #optional, adjust the title of input file 3 present in final plot title_4 = "PUCSC", #optional, adjust the title of input file 4 present in final plot col_1 = "red", #optional, set the color of High frequency genes col_2 = "yellow", #optional, set the color of Low frequency genes common_gene_threshold = 20, height_1 = 9, #optional, adjust the height of final plot width_1 = 11, #optional, adjust the width of final plot folder = "compare_gene_four_lists")
{width=300}
The ten functions we've run so far only require users to provide the CNV results, which can already generate plenty of summary results that include sufficient information to point us in the right direction to dig deeper. If we want to explore a particular region in more depth, there are extra functions in the HandyCNV
package that can combine CNV data with other data sources. We will introduce some of these functions in the following section.
HandyCNV
, to allow comparing CNVs between different reference genomes?We can use the convert_map
function to prepare a map file; generally, we will need one for each reference file. In our examples, we have used the two bovine reference genomes ARS-UCD 1.2 and UMD 3.1. For HandyCNV
, map files hould have the format shown in the table below, with the four columns Chromosome
, SNP
, MorganPos
, and PhysicalPos
. If the centiMorgan position (MorganPos
) is unknown, we can assign the missing value as 0
, or provide the phyisical position in megabases. This format is the same as the standard .map
format used in Plink and other software, and can be exported from GenomeStudio. The column names are not required.
tabl <- " ----------------------------------------- Chr SNP MorganPos PhysicalPos ---- ------------- ---------- ----------- 5 14636 56.3275 56327514 5 16084 56.329 56328962 5 19597 56.3325 56332475 1 1_115292065 114.371 114370731 1 1_115292107 114.371 114370773 -----------------------------------------"
Here, we have the map files for both bovine references included in the package, at the following paths:
ars_map <- system.file("extdata", "Demo_data/map/ARS1.2_GGPHDV3.map", package = "HandyCNV") umd_map <- system.file("extdata", "Demo_data/map/UMD3.1_Bovine.map", package = "HandyCNV")
The function will not only generate map files, for use in compare_cnv
and compare_cnvr
, it will also generate map file used in Plink and PennCNV for both inputs. The conversion process also compares the two input maps and produces summary tables and figures.
convert_map(default_map = umd_map, #when running your own data, set this argument to 'default_map = "localpath/file"' target_map = ars_map, #when running your own data, set this argument to 'target_map = "localpath/file"' chr_set = 30, defMap_title = "UMD 3.1", #set the label of axis tarMap_title = "ARS-UCD 1.2", #set the label of axis col_1 = "green4", #set colour for the type of Match in SNP comparison plot col_2 = "red", #set colour for the type of Unmatched in SNP comparison plot col_3 = "deeppink2", #set colour for the bar of Target_Map in SNP density plot col_4 = "deeppink2", #set colour for line of Target_map in density plot col_5 = "turquoise3", #set colour for point of Default_Map in SNP density plot col_6 = "turquoise3", #set colour for line of Default_map in SNP density plot species = "Bovine", height_c = 16, width_c = 22, height_d = 14, width_d = 25)
{width=320px}
{width=320px}
The plot_cnvr_panorama
function was built to achieve this task. But we need provide more information to this function:
a. CNVR results(generated by 'call_cnvr' in section 4)
b. Annotated CNV file (generated by 'call_gene' function)
c. Signal Intensity file (In general it export from GenomeStudio which used in CNV detection process)
d. Map file (Four columns used in Plink or generated by 'convert_map' function)
e. Genotype files (generated by Plink with '--make-bed' flag which include BED, BIM, FAM three files)
plot_cnvr_panorama(cnvr = "call_cnvr_Penn/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 cnv_annotation = "call_gene_Penn_UCSC/cnv_annotation.txt", #standard file was generated by 'call_gene' function in Section 5.2 intensity = "Demo_data/demo_intensity.txt", map = "Demo_data/demo.map", prefix_bed = "Demo_data/demo_HandyCNV", ld_heat = TRUE, sample_size = 382, common_cnv_threshold = 0.18, col_0 = "green", #optional, customize color for o copy CNV col_1 = "hotpink", #optional, customize color for 1 copy CNV col_2 = "NA", #optional, customize color for 2 copy CNV col_3 = "turquoise", #optional, customize color for 3 copy CNV col_4 = "aquamarine", #optional, customize color for 4 copy CNV col_gene = "red", #optional, customize color of gene width_1 = 16, height_1 = 30, folder = "cnvr_panorama_heat_ld")
{width=150px}
For LD, we can replace the heatmap with the classic inverted triangle diagram by setting the argument ld_heat = FALSE
, which will use the gaston
package to plot LD. Because this requires converting the figure to ggplot
format (via the ggplotify
package), this process can be time consuming, and sometimes does not work well for large genomic regions. Therefore, we suggest that you use ld_heat = TRUE
initially, before narrowing in on a particular region of interest to display with the classic LD plot.
plot_cnvr_panorama(cnvr = "call_cnvr_Penn/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 cnv_annotation = "call_gene_Penn_UCSC/cnv_annotation.txt", #standard file was generated by 'call_gene' function in Section 5.2 intensity = "Demo_data/demo_intensity.txt", map = "Demo_data/demo.map", prefix_bed = "Demo_data/demo_HandyCNV", sample_size = 382, common_cnv_threshold = 0.18, ld_heat = FALSE, col_0 = "green", #optional, customize colour for o copy CNV col_1 = "hotpink", #optional, customize colour for 1 copy CNV col_2 = "NA", #optional, customize colour for 2 copy CNV col_3 = "turquoise", #optional, customize colour for 3 copy CNV col_4 = "aquamarine", #optional, customize colour for 4 copy CNV col_gene = "red", #optional, customize colour of gene width_1 = 16, height_1 = 30, folder = "cnvr_panorama_classical_ld")
{width=150px}
CNVs are heritable: with the pedigree information, particularly for the sires in a livestock context, we can produce plots of CNV sources. The plot_cnvr_source
function can produce plots by sire, herd, and sire source. Sire source can be any freeform text, for example the country of origin. The expected input format is comma-separated (with a header), containing the following columns (in any order):
Chip_ID
(matched to the Sample_ID
column in the CNVR data)Herd
Sire_ID
Sire_Source
plot_cnvr_source(cnvr = "call_cnvr_Penn/cnvr.txt", #standard file was generated by 'call_cnvr' function in section 4 clean_cnv = "cnv_clean/penncnv_clean.cnv", #standard file was generated by 'cnv_clean' function in section 1 pedigree = system.file("extdata", "Demo_data/Pedigree/Pedigree.csv", package = "HandyCNV"), Frequent_threshold = 80, col_0 = "violet", col_1 = "turquoise", col_3 = "tomato", col_4 = "lightblue", report_id = T, height_1 = 20, width_1 = 15)
{width=150px}
The plot_gene
function can do this for a specific reference gene set and genomic region.
plot_gene(refgene = "refgene/cow_ARS_UCSC.txt", chr_id = 7, start = 11, end = 12, height_1 = 3, width_1 = 10, show_name = c(11.0, 11.35, 11.75, 12.00), gene_font_size = 2.5)
{width=300px}
An ROH (Run of homozygosity) is a region of an individual's genome where almost all variants have homozygous genotypes. These regions generally indicate the presence of inbreeding in that individual's ancestry, with more recent inbreeding leading to longer ROHs.
The function roh_window
will produce two summary tables, as well as plotting the high frequency regions by chromosome. The first table summarises the total numbers and lengths of ROHs for each individual; the second reports summary statistics for aggregated windows of ROHs across all individuals.
roh_window(roh = "cnv_clean/cnvpart_roh.cnv", window_size = 2, length_autosomal = 2489.386, threshold = 0.15, folder = "roh_window", col_higher = "red", col_lower = "black", height_1 = 15, width_1 = 11, legend_x = 0.92, legend_y = 0.06, ncol_1 = 5)
{width=300px}
Visualising ROH is similar to visualising CNVs. The exception is that, because length is particularly informative for ROHs (about the age of inbreeding), the ROHs are coloured by their lengths. The roh_visual
function can plot ROHs, either across the whole genome, or chromosome by chromosome. It also supports restricting plotting to regions of interest, and allows adding a panel of gene locations.
To plot a single chromosome, use the chr_id
argument.
roh_visual(clean_roh = "cnv_clean/cnvpart_roh.cnv", chr_id = 6, width_1 = 14, height_1 = 9, folder = "roh_visual", col_short = "red", col_mid = "blue", col_long = "black")
{width=300px}
To plot a single region of interest, add the start_position
and end_position
arguments (units in megabases).
roh_visual(clean_roh = "cnv_clean/cnvpart_roh.cnv", chr_id = 6, start_position = 36.5, end_position = 40.6, width_1 = 16)
{width=300px}
To add a panel of gene locations, add the refgene
argument, pointing to a file containing genes annotated on the appropriate reference genome.
roh_visual(clean_roh = "cnv_clean/cnvpart_roh.cnv", chr_id = 6, start_position = 36.5, end_position = 40.61, refgene = "refgene/cow_UMD_UCSC.txt")
{width=300px}
If the ROH region has too many genes, plotting all the gene names will make the figure look too busy. We can choose the most important region(s) to plot using the show_names
argument. In this example, we set two regions with the following argument: show_name = c(37.28, 37.49, 38.25, 39.52)
. This will show the gene names in two regions: (37.28 - 37.49 Mb) and (38.25 - 39.52 Mb).
roh_visual(clean_roh = "cnv_clean/cnvpart_roh.cnv", chr_id = 6, start_position = 36.5, end_position = 40.5, refgene = "refgene/cow_UMD_UCSC.txt", show_name = c(37.28, 37.49, 38.25, 39.52), col_short = "black", col_mid = "green", col_long = "red")
{width=300px}
Above, We have plotted ROH in an interested region, but that method can only plot short ROHs which located in the given region, it will remove some longer ROHs which are partially overlapped to the given interval, that method can made a nice picture but will lost some samples, it still useful because we already know the sample size in that region. But if we want to present the whole length of ROH overlapped to our interested region (Such as a region of target gene), we can assign a vector to 'target_region' argument in 'roh_visual' function.
roh_visual(clean_roh = "cnv_clean/cnvpart_roh.cnv", target_region = c(6,38.25,39.52), #within the vector, the first integer indicate Chromosome. the rest are the start and end point, unit in megabases col_short = "red", #optional col_mid = "gold", #optional col_long = "blue", #optional height_1 = 12, #optional width_1 = 14)
{width=300px}
When we have some interested individuals, we can plot all the ROHs for these samples separately by setting the 'individual_id' arguments as following:
roh_visual(clean_roh = "cnv_clean/cnvpart_roh.cnv", individual_id = "201094560008_R04C01", height_1 = 14, #optional width_1 = 22, #optional col_short = "green", #optional col_mid = "yellow", #optional col_long = "red")
{width=300px}
ROH results will return us some long regions, but we may interested on some specific genes in the ROH region that experienced strong selection. Get the haplotypes of these genes in our experimental population may help us to find out some interesting patterns. we known that the ROH region generated by SNPs may be discontinuous due to insufficient marker density, beside, when infer ROH we considered the influence of genotyping quality by setting the minimal window size, accepting some missing value or maximum heterozygosity SNPs in the window, these conditions will exclude the short ROH regions in the final results. For instance, when we get the haplotype for a short region (such as a Gene), the frequency of homozygosity haplotypes may differing to the original ROH region it located.
The 'get_haplotype' function can help us to get the haplotype for all samples by given region. But now the application is limited, it only accept the phased genotype by Beagle (Version 5.1) with physical position, we may expend the compatibility later if necessary. The standard format of phased genotype and demo data used at here can be download from our github repository (https://github.com/JH-Zhou/HandyCNV/tree/master/vignettes/orse_chr22_phased.vcf.gz). Now, let's go ahead and make haplotypes.
Extracting the haplotype will take three steps and combine three functions. The first step is to prepare the standard format for 'HandyCNV'. We can use 'prepare_geno' function to prepare the phased genotype. We can also assign the 'convert_letter = TRUE' in 'pre_phased' function to convert the format of haploids from 0,1 to letters. See code as following. After running, it will return three list files contain the map, genotype and the columns index file.
genotype_by_number <- prep_phased(phased_geno = "orse_chr22_phased.vcf.gz") #'orse_chr1_phased.vcf.gz' is the demo data, need to download and save it in working directory genotype_by_letter <- prep_phased(phased_geno = "orse_chr22_phased.vcf.gz", convert_letter = T) #'orse_chr1_phased.vcf.gz' is the demo data, need to download and save it in your working
Second step, we need to assign the interested region. Here we take the GDF5 gene as example, it located on Chr22:26,183,610-26,188,173 bp. The idea in this step is to find out the closest SNPs in this SNP chip.
region <- closer_snp(phased_input = genotype_by_letter$map, #the map file in the list generated in 'pre_phased' function above chr = 22, start = 43.771500, end = 43.856076) #physical position unit in 'megabases'
Third step to get haplotype. It will return two lists contains the reocding haplotype and the index of recode type and the original haploid.
haplotype_number <- get_haplotype(geno = genotype_by_number, pos = region) haplotype_letter <- get_haplotype(geno = genotype_by_letter, pos = region)
Now, we can count the frequency of each haplotype in our population and to check what the haploids are in these haplotypes.
data.frame(table(haplotype_number$hap_type$Recode_Type, dnn = "Haplotype")) haplotype_number$hap_index data.frame(table(haplotype_letter$hap_type$Recode_Type, dnn = "Haplotype")) haplotype_letter$hap_index
Then, visualize haplotype of BMP7 gene.
haplo_visual(haplotype = haplotype_letter, xlab_text = "BMP7 Gene ")
{width=300px}
The function convert_coord
can used to convert coordinates for lists of intervals. Currently, the function is limited to inputs generated by the convert_map
function, and can only convert the coordinates for intervals on the same type of SNP Chip. Converting coordinates may change the total length of the intervals, as of the positions and orders of the SNPs on the chromosome will potentially differ between various versions of reference genome; therefore, we should pay attention to the quality of the conversion, and use the results with caution. The function produces a summary table that summarises how many intervals were converted successfully, and reports on the differences in length between the converted and original intervals.
The proxy_range
argument supports adjusting the genomic range to search and assign "proxy" locations for markers which are missing in the target map, in order to try to best reduce the loss of intervals after conversion. The principal is try to reduce the length of the intervals that were assigned to the proxy locations. For missing markers at the start positions, the search will begin at the missing position and proceed downsteam, selecting the closest suitable marker as the proxy position. For the end position, the search will begin from the missing end interval and proceed upsteam, again assigning the closest suitable marker as the proxy position. If no suitable markers can be found within proxy_range
basepairs, the interval will be classed at "Missing".
The optional argument len_diff_threshold
is a quality control measure on the newly converted intervals that imposes a maximum difference of length between converted and original intervals. Only intervals with remapped positions within len_diff_threshold
basepairs of the original position will pass the threshold.
Here, we convert the CNVPartition CNVR results from UMD3.1 coordinates to ARS1.2. The outputs are saved into the convert_coord_penn_cnv
folder, and summary statistics about the conversion are displayed below.
convert_coord(input_def = "cnv_clean/cleancnv.cnv", def_tar_map = "convert_map/def_tar_map.map", proxy_range = 100000, len_diff_threshold = 100000, folder = "convert_coord_penn_cnv")
We can also convert coordinates from the target map to the default map. For example, let's convert the PennCNV ARS results to UMD 3.1. After converting the coordinates of CNVs or CNVRs, we could use the results which passed the quality check to make comparisons with the compare_cnv
and compare_cnvr
functions as section 8 and 9 described.
convert_coord(input_def = "call_cnvr_standard/cnvr.txt", proxy_range = 100000, len_diff_threshold = 100000, def_tar_map = "convert_map/def_tar_map.map", folder = "convert_coord_penn_cnvR")
Plotting SNP density may help us learn about the SNP distribution on chromosomes. The parameter of SNP density may affect ROH detection in some tools. 'plot_snp_density' only working with map file that has four columns in the order as follows: Chromosome, SNP ID, Position in Morgans, Physical Position. No header required.
plot_snp_density(map = "convert_map/target_plink.map", max_chr = 24, #optional top_density = 60, #optional low_density = 20, #optional color_top = "red", #optional color_low = "blue", #optional color_mid = "black", #optional legend_position = c(0.9, 0.1), #optional x_label = "Physical position\n物理位置", #optional y_label = "SNPs/Mb\n每1Mb区间的SNP数",#optional ncol_1 = 5) #save the plot by 'ggsave' #ggsave(filename = "snp_density.png", width = 26, height = 18, units = "cm", dpi = 350)
{width=300px}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.