This worked example shows:
Try modifying the code to import your own data!
This code imports genetic data from 181 individuals of Colombia spotted frogs (Rana luteiventris) from 12 populations. The data are a subsample of the full data set analyzed in Funk et al. (2005) and Murphy et al. (2010). Please see the separate introduction to the data set.
LandGenCourse
. Install some packages needed for this worked example. Note: popgraph
needs to be installed before installing gstudio
.
if(!require("adegenet")) install.packages("adegenet") if(!requireNamespace("popgraph", quietly = TRUE)) { install.packages(c("RgoogleMaps", "geosphere", "proto", "sampling", "seqinr", "spacetime", "spdep"), dependencies=TRUE) remotes::install_github("dyerlab/popgraph") } if(!requireNamespace("gstudio", quietly = TRUE)) remotes::install_github("dyerlab/gstudio")
Note: the function library
will always load the package, even if it is already loaded, whereas require
will only load it if it is not yet loaded. Either will work.
library(adegenet) library(gstudio) library(LandGenCourse) library(tibble) library(here) library(vcfR) library(pinfsc50) library(utils)
Create a new R project for this lab: File > New Project > New Directory > New Project.
This will automatically set your working directory to the new project folder. That's where R will look for files and save anything you export.
Note: R does not distinguish between single quotes and double quotes, they are treated as synonyms, but the opening and closing symbols need to match.
The dataset ralu.loci
is available in the package 'LandGenCourse' and can be loaded with data(ralu.loci)
.
data(ralu.loci)
Below, however, we'll import it from a comma separated file (.csv), as this is the recommended format for importing your own data (e.g. from Excel).
First we need to copy the file ralu.loci.csv
to the downloads
folder inside your project folder. The first line checks that the folder downloads
exists and creates it if needed.
if(!dir.exists(paste0(here(),"/downloads"))) dir.create(paste0(here(),"/downloads")) file.copy(system.file("extdata", "ralu.loci.csv", package = "LandGenCourse"), paste0(here(), "/downloads/ralu.loci.csv"), overwrite=FALSE)
Check that the file ralu.loci.csv
is now in the folder downloads
within your project folder. From your file system, you can open it with Excel. Note that in Excel, the columns with the loci must be explicitly defined as 'text', otherwise Excel is likely to interpret them as times.
Excel instructions:
Opening an existing .csv file with genetic data:
Creating a new Excel file to enter genetic data:
Saving Excel file for import in R:
fig.source <- system.file("extdata", "ExcelTable.png", package = "LandGenCourse") fig.path <- paste0(here::here(), "/downloads/ExcelTable.png") invisible(file.copy(from=fig.source, to=fig.path))
Here is a screenshot of the first few lines of the Excel file with the genetic data:
cat(paste0("![](", fig.path, "){width=80%}"))
As the file is saved in csv format, we use the function 'read.csv' to import it into an R object called Frogs
. Display the first few rows and columns to check whether the data have been imported correctly:
Frogs <- read.csv(paste0(here(), "/downloads/ralu.loci.csv"), header=TRUE) as_tibble(Frogs)
Frogs
contains 181 rows and 11 columns. Only the first few are shown.<fctr>
means that the variable has been coded as a factor.NA
, (which shows up as NA:NA
for missing genetic data). There seem to be quite a few missing values.We'll add unique ID's for the frogs (to show how these would be imported). Here we take, for each frog, a sub-string of Pop
containing the first three letters, add the row number, and separate them by a period.
Frogs <- data.frame(FrogID = paste(substr(Frogs$Pop, 1, 3), row.names(Frogs), sep="."), Frogs) as_tibble(Frogs)
If you want to export (and import) your own files, adapt this code.
Note: when using R Notebooks, problems can arise if R looks for files in a different place (more on this in Week 8 Bonus material). Here, we use a safe work-around, with the function here
, that will help make your work reproducible and transferable to other computers. For this to work properly, you must work within an R project, and you must have writing permission for the project folder.
The function here
from the package here
finds the path to your project folder. Remove the hashtags and run the two lines to see the path to your project folder, and the path to the (to be created) output folder. The function paste0
simply pastes strings of text together, without any separator (hence the '0' in the function name).
here() paste0(here(),"/output")
The following line checks whether an output
folder exists in the project folder, and if not, creates one.
if(!dir.exists(paste0(here(),"/output"))) dir.create(paste0(here(),"/output"))
Now we can write our data into a csv file in the output folder:
write.csv(ralu.loci, paste0(here(),"/output/ralu.loci.csv"), quote=FALSE, row.names=FALSE)
To import the file again from the output folder, you could do this:
ralu.loci.2 <- read.csv(paste0(here(),"/output/ralu.loci.csv"), header=TRUE)
We use the function df2genind
of the adegenet
package to import the loci into a genind
object named Frogs.genind
. Check the help file for a definition of all arguments and for example code.
?df2genind
Some explanations:
sep
.FrogID
.Pop
or SiteName
.NA
.2
."codom"
for codominant markers like microsatellites, or "PA"
for presence-absence, such as SNP or AFLP markers. Note: you can use either "" or ''.Frogs.genind <- df2genind(X=Frogs[,c(4:11)], sep=":", ncode=NULL, ind.names= Frogs$FrogID, loc.names=NULL, pop=Frogs$Pop, NA.char="NA", ploidy=2, type="codom", strata=NULL, hierarchy=NULL)
Printing the genind object's name lists the number of individuals, loci and alleles, and lists all slots (prefaced by @
).
Frogs.genind
There is a summary
function for genind
objects. Notes:
Group
here refers to the populations, i.e., Group sizes
means number of individuals per population. summary(Frogs.genind)
The summary lists each attribute or slot
of the genind
object and summarizes its content. Technically speaking, genind
objects are S4 objects in R, which means that their slots are accessed with the @
sign, rather than the $
sign for attributes of the more commonly used S3 objects. Interestingly, the object does not store the data table that was imported, but converts it to a table of allele counts.
@tab
{-}Table with allele counts, where each column is an allele. Allele name A.1
means "locus A, allele 1". For a codominant marker (e.g. microsatellite) and a diploid species, allele counts per individual can be 0, 1, 2, or NA=missing. Here are the first few lines:
as_tibble(Frogs.genind@tab)
@loc.n.all
{-}Number of alleles per locus, across all populations.
Frogs.genind@loc.n.all
@loc.fac
{-}This is a factor that indicates for each allele (column in @tab
) which locus it belongs to. The levels of the factor correspond to the loci.
Frogs.genind@loc.fac
@all.names
{-}List of allele names, separately for each locus. Note: the alleles are treated as text (character), even if they are coded as numbers. They are sorted in order of occurrence in data set.
Frogs.genind@all.names
The following slots are automatically filled with default values unless specified by user during import into genind
object:
@ploidy
: ploidy for each individual.@type
: codominant (microsatellites) or presence/absence (SNP, AFLP)The following slots may be empty unless specified during or after import into 'genind' object:
@other
: placeholder for non-genetic data, e.g. spatial coordinates of individuals.@pop
: vector indicating the population that each individual belongs to.@strata
: stratification of populations, e.g. within regions or treatments.@hierarchy
: optional formula defining hierarchical levels in strata.While genind
objects are used by many functions for analyzing genetic data, the gstudio
package provides an interface for additional analysis. We use the function read_population
to import the data.
?read_population
Some explanations:
Working directory confusion: If you are working within an R project folder, then that folder is automatically set as your working directory. If you are working in an R session outside of an R project, then you need to actively set the working directory in the R Studio menu under: Session > Set working directory > Choose Directory
.
Frogs.gstudio <- read_population(path=system.file("extdata", "ralu.loci.csv", package = "LandGenCourse"), type="separated", locus.columns=c(3:10), phased=FALSE, sep=",", header=TRUE)
Check the column types: SiteName
and Pop
are now character vectors, and the loci columns A - H are now of class locus
(it is indicated that 'locus' is a S3 object type).
str(Frogs.gstudio)
Frogs.gstudio
is a data frame, and we can manipulate like any other data frame. Here we will add the variable FrogID
as the first column.
Frogs.gstudio <- data.frame(FrogID=Frogs$FrogID, Frogs.gstudio) head(Frogs.gstudio)
We can use the dataframe with the locus objects from gstudio to import the data into a genind
object. In this case, we set sep=":"
, because the locus
object in gstudio
stores the alleles at each locus separated by a colon, and NA.char=""
, because gstudio
stores missing values as empty cells.
Frogs.genind2 <- adegenet::df2genind(X=Frogs.gstudio[,c(4:11)], sep=":", ncode=NULL, ind.names=Frogs.gstudio$FrogID, loc.names=NULL, pop=Frogs.gstudio$Pop, NA.char="", ploidy=2, type="codom", strata=NULL, hierarchy=NULL) Frogs.genind2
First, we'll import a relatively small SNP dataset from a text file. This dataset will be analyzed in the Week 6 Worked Example. The data are stored in a tab-delimited text file.
Let's have a look at the first few rows and columns:
family
: these are ID's of the genotyped tree individuals (from which seeds were grown in a common-garden experiment, hence they are called "family" in the dataset).population
: the trees were sampled from a set of populations.snpXXX.Plmn
: each column is a diploid SNP.To import your own data, replace infile
by the path to the text file with your SNP data. Change settings as necessary (check whether data are separated by tabs or commas!).
infile <- system.file("extdata", "WWP_SNP_genotypes.txt", package = "LandGenCourse") Trees <- read.table(infile, header = TRUE, sep = "\t") Trees[1:6, 1:6]
We should replace the SNP column names: if they contain a period ., then the function df2genind thinks that the first part indicates the locus and the second part the allele. The following line of code splits each column name by the period and only retains the first part.
names(Trees) <- unlist(lapply(names(Trees), function(x) strsplit(x, "[.]")[[1]][1]))
To import these data frame into a genind
object, we need to tell R how the SNP data are coded (check the df2genind
help file!)
ncode = 1
: each allele (A, C, G, T) is coded with a single character.sep = ""
: the two alleles per individual and locus are not separated by any symbol.NA.char= "NA"
: missing values are coded as NA.Trees.genind <- adegenet::df2genind(X=Trees[,-c(1:2)], sep="", ncode=1, ind.names=Trees$family, loc.names=NULL, pop=Trees$population, NA.char="NA", ploidy=2, type="codom", strata=NULL, hierarchy=NULL) Trees.genind
A common data format for genomic data is vcf. Here's some background information (all but the first are vignettes from the vcfR
package):
The R package vcfR
makes it easy to import vcf files and convert to other formats. Here we use it to convert the data into a genind
object. The following code is adapted from the above listed vcfR
vignettes.
We'll import a file that is distributed with the pinfsc50
package. Note that vcf
files are commonly compressed in gz
archives. To import your own data, replace vcf_file
by the path to your own gz
archive.
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50") vcf <- read.vcfR( vcf_file, verbose = FALSE )
Note that the package vcfR
can import three types of genomic data files listed below. We will only use the .vcf
file here.
vcf
: SNPs.fasta
: genomic reference sequence data.gff
: annotationConvert to genind
object:
SNP_genind <- vcfR2genind(vcf) SNP_genind
This dataset has 18 individuals genotyped at 44,137. It may be better to use a genlight
object instead of genind
. This is an alternative object type from the package adegenet
that can store large genomic data much more efficiently.
Note: the current implementation can only store two alleles per locus, loci with more alleles will be dropped and a warning is printed.
SNP_genlight <- vcfR2genlight(vcf) SNP_genlight
Note that instead of 44,137 alleles, we now have 21,719 binary SNPs. This is one of the ways by which the genlight
object is more efficient than genind
. Overall, the size has been reduced from 16.4 Mb to 2.2 Mb!
R Notebook: Create a new R Notebook for each weekly exercise. Watch the course video "Week 0: Intro to R Notebooks" as needed. At the end, "knit" it to an html file and view it in your browser.
Good news: if you can knit the file, the code can stand by itself (it does not depend on what dyou did in your R session before) and runs without errors. This is a good check. If there are error messages, check the 'R Markdown' tab for the code line number and try to fix it.
Task: Import the data set pulsatilla_genotypes
, which we'll use in a later lab (Week 14), into gstudio
and convert it to a genind
object.
Data: This file contains microsatellite data for adults and seeds of the herb Pulsatilla vulgaris sampled at seven sites. Reference: DiLeo et al. (2018), Journal of Ecology 106:2242-2255. https://besjournals.onlinelibrary.wiley.com/doi/10.1111/1365-2745.12992
The following code copies the file into the 'downloads' folder in your R project folder:
file.copy(system.file("extdata", "pulsatilla_genotypes.csv", package = "LandGenCourse"), paste0(here(), "/downloads/pulsatilla_genotypes.csv"), overwrite=FALSE)
Variables:
Hints:
a) Load packages: Make sure the following packages are loaded: gstudio
, here
,tibble
and adegenet
.
b) View data file: Adapt the code from section 2.c to import the raw data set. The file has a header row. View it. How are the genetic data coded?
c) Import data into gstudio: Adapt the code from section 5.b to import the genetic data with 'gstudio'. The loci are in columns 6 - 19. What setting for type
is appropriate for this data set? Check the help file for read_population
.
d) Check imported data: Use str
or as_tibble
to check the imported data. Does each variable have the correct data type? Note: there should be 7 variables of type locus
.
e) Check variable types: Create a bulleted list, like the one above, with the variables. For each variable, list their R data type (e.g., numeric, integer, character, logical, factor, locus). Check the cheatsheet "R markdown language" as needed.
f) Convert to genind object: Modify the code from section 5.e to convert the data from gstudio
to a genind
object. Ignore the warning about duplicate labels. Print a summary of the genind
object.
Question: What is the range of the number of genotyped individuals per population in this dataset?
# Note: this line of code detaches all packages that were loaded. This is necessary only for technical reasons of building the R package. LandGenCourse::detachAllPackages()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.