knitr::opts_chunk$set(warning = FALSE, collapse = TRUE, comment = "#>", tidy = TRUE, out.width = "90%", fig.height=4, fig.width=7, out.extra='style="margin: auto; display: block; padding-top: 15px;"')
This web app and open source software package for the R programming language was built to facilitate exploration of the naturally occurring sequence diversity for your favorite gene or gene family. We hope this app will help plant biologists without bioinformatics experience access the 1001 genomes project’s rich catalog of natural variation, formulate hypotheses and identify new interesting alleles in their favorite gene family.
The 1001 genomes dataset was collected with the intention of genome scale analyses, such as genome wide association studies, in which a particular phenotype is quantified for all of the accessions and sequence variants that are associated with phenotypic variation are identified. In recent work, we have utilized this rich dataset in a different manner to measure the effects of coding sequence variation on protein functional variation and phenotypic variation. This work has expanded the sequence-function-phenotype map for the auxin-signaling F-box gene family, and this product will facilitate similar analysis other genes, gene families, and genomic regions.
So pick out your favorite gene, gene family or genomic region and follow the steps in this guide to analyze how it varies in plants collected from all over the world. For this example we will be using the auxin-signaling F-box genes, TIR1 and AFB2 which have the TAIR locus IDs AT3G62980 and AT3G26810. Click on one of the previous links to search TAIR for the TAIR IDs of your favorite genes.
Before digging into the data, make some hypotheses about what genes, domains and/or positions in your favorite gene family might be under different evolutionary pressures. We can make some inferences about the evolutionary pressure acting on particular genes based on the sequence and geographic diversity observed in the 1001 genomes dataset. Thomas Mitchell-Olds and Johanna Schmitt published an excellent overview of the evolutionary inference from natural genetic diversity in A. thaliana [@mitchell-olds_genetic_2006]. We can calculate a measure of genetic diversity as the average number of nucleotide differences per site between two DNA sequences in all possible pairs in the sample population, and is denoted by $\pi$.
Evolutionary pressures potentially exert influence across all levels of biology, nucleotides, genes, genomes, phenotypes, organisms and populations. This makes any inference from study of genetic diversity contingent on potentially unknown effects of epistasis between genes, population structure, environment, and many other effectors. Additionally, as an agricultural tag-along, much of the genetic variation in A. thaliana is due to admixture between populations as the result of human influenced migration. So keep in mind throughout this analysis the limitations of examining a single snapshot in time of a potentially biased collection of frequently self-fertilizing organisms that were dispersed by human migration. Because of these limitations, inferring theories about how evolutionary forces are acting on your favorite genes is very uncertain business. However, if we assume a few things we can make some speculations.
Assumptions:
1. Polymorphisms in the dataset are not completely detrimental to organismal fitness.
2. Most nonsynonymous polymorphisms in a gene result in a reduction in function.
Now, taking a gene-centric view of evolution, here are some of the evolutionary pressures, or types of natural selection, that the genes we are analyzing may be experiencing and what effects these pressures have on genetic diversity.
If a gene's sequence is highly conserved (low diversity) it may be under purifying/negative/stabilizing selection. If a genes' function is crucial to the fitness of the organism, and most polymorphisms result in loss of function, we would expect a low level of nonsynonymous polymorphism in this gene. This is because deleterious polymorphisms that do arise in this gene within the population will be outcompeted by the more-fit parent. This results in the decreased genetic diversity in genes under purifying selection.
If evolutionary pressure is very weak on a gene its sequence might change with no effect on organismal fitness. This allows a gene to accumulate many mutations that may persist in the population or be lost at random. This results in random fluctuations in allele frequencies, which when combined with an appreciable rate of mutation and a sufficient population size yields increased genetic diversity in the population.
Beyond purifying selection and drift it is very difficult to make any strong inferences on the population as a whole. However there are two more types of selective pressure, balancing or neutral selection and positive or directional selection.
If a polymorphism (or allele) is associated with a phenotype that provides an organism with a fitness advantage this phenotype and associated allele may be experiencing directional or positive selection. Positive/directional selection results in an increase in the prevalence of the advantageous allele within the population over time. Directional selection is very difficult to infer and is a very active area of research. Directional selection, like stabilizing selection is associated with reduced genetic diversity within the population under selection. However, unlike stabilizing selection the gene under directional selection will potentially have two major alleles that provide adaptive potential to different environments or genetic backgrounds.
Genes of high diversity may also be under balancing selection, also called neutral divergence. Balancing selection maintains multiple variants for long evolutionary times because different alleles provide selective advantages in different environments or genetic backgrounds. This results in high diversity at the particular site under selection and increased diversity at closely linked neutral sites due to genetic hitchhiking with this allele under balancing selection.
Now go collect and read the literature about your family and make some hypotheses!
Here are some questions to help you get started:
What structure/function knowledge is there?
What is each domain in the protein doing?
Are there any positions that are of vital importance?
Is there a published protein sequence alignment of your gene family?
What positions in the protein family are highly conserved?
What positions are highly variable?
Is there a published phylogenetic tree of your gene family?
What are the most recently duplicated sister gene pairs?
What do you know about the phenotypes of mutants in these pairs?
Do the genetics suggest these pairs are redundant?
In TIR1 and AFB2 there are 2 domains. The F-box domain interacts with the SCF components [@gray_identification_1999], and the LRR domain binds auxin and Aux/IAA proteins [@kepinski_auxininduced_2004]. The F-box is roughly 40 amino acids and wraps tightly around the SKP1 component of the SCF complex. The LRR is greater than 500 amino acids and binds auxin in a pocket formed by a horseshoe shaped chain of LRRs [@tan_mechanism_2007]. The outside of this horseshoe shape does not have any known interactions other than a dimerization interface [@dezfulian_oligomerization_2016]. Because the F-box domain must interact with the SCF complex for TIR1/AFB2 to perform their known function, we might expect the F-box to be highly conserved. The small size of the F-box also helps strengthen this hypothesis. The LRR domain is much larger and may have differing specificities for Aux/IAAs, so we might expect it to be more diverse. We would also expect the residues of the LRR that contact auxin and the Aux/IAA degron to be highly conserved as they are key to the function of these proteins.
In addition to calling the functions of the r1001genomes package in R directly. There is also a GUI provided in the form of a Shiny app. This app can perform some of the analysis tasks the package is capable of performing, without the need to have knowledge of programming in R.
In this vignette panels like this one are used to show examples of using the app to achieve similar results as using the R function calls directly in this vignette. Screenshots will usually be provided to make it clear what section or tab we are talking about
knitr::include_graphics("Screenshots/App_Initial.PNG") knitr::asis_output(" ") # This newline required for headings to render after image
To start the Shiny app, run the following command:
run1001genomes()
The app should appear in a new window, and look similar to the image above.
In these lighter colored panels, the equivalent R code will be displayed.
The first step in retrieving the natural variation in our genes is to retrieve the information about where the genes are in the genome of Col-0, the laboratory reference accession, and how the mRNA and coding sequence is laid out. This information is called the gene model.
Do you have a spreadsheet of loci?
Here is some quick code to make a text string you can copy and paste into the app gene input box.
AFB_loci <- read.csv(system.file("extdata", "AFB_gene_ids.csv", package = "r1001genomes")) paste(AFB_loci$tair_locus, collapse = ",")
Occasionally there are multiple models for a single gene if there are multiple possible transcripts or splice variants. Does your gene have multiple models?
The 1001 genomes dataset is stored as a VCF (Variant Call Format) file. This is a very efficient format for storing such a large dataset (the 1135 genomes are ~17gb), and for some analyses its matrix format is very useful.
Do these values match your hypotheses? Is there anything unexpected?
For our TIR1 and AFB2 example, we can make several interesting observations based on this dataset.
1. AFB2 had higher number of total polymorphisms than TIR1, across all subsets of sites except for introns.
2. AFB2 also has a higher total coding diversity and higher ratio of diversities at nonsynonymous to synonymous sites.
This was expected, as TIR1 mutants have a strong loss of auxin sensitivity whereas AFB2 mutants have a much weaker effect. Therefore, polymorphisms in TIR1 are much more likely to have an effect on phenotype than polymorphisms in AFB2.
Now we'll move on to the second "Diversity Plot" tab. This tab will calculate the diversity at each position within the coding sequence of each gene. In addition to calculating the nucleotide diversity of a whole sequence, like the transcript of a gene, we can also calculate the diversity statistic for each position along the reference genome. This could be useful if you suspect a particular region of your gene may be more important to function or more sensitive to nucleotide or amino acid changes.
Do these values match your hypotheses? Is there anything unexpected?
In AFB2 there are five alleles with diversity greater than 1E-2 and 20 alleles with diversity greater than 3E-3. TIR1 has two alleles with diversity greater than 1E-2 and nine alleles with diversity greater than 3E-3. This high number of high frequency alleles in the population suggests that AFB2 is more likely to be under drift or balancing selection relative to TIR1 or that TIR1 is under stronger purifying selection than AFB2. It is important to consider however that sampling bias in certain geographic populations could affect these hypotheses. To get some idea of if this is true we will look at the geographic distribution of these alleles.
As an initial visualization of the population structure that may be influencing the allele frequency and prevalence of alleles in a gene, we can map the collection locations from which the accessions were collected. If the accessions sharing a similar polymorphism are all closely clustered then it is likely several members of the same population derived from a common ancestor are represented. If there are many distantly located accessions sharing the same polymorphism this may result from human-associated migration and admixture or possibly from independent mutation events. In addition to these patterns, look for potential environmental gradients of particular alleles (e.g. is a particular allele only found above/below a certain latitude or only in coastal/desert environments?).
description
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.