1. Overview of Worked Example {-}

a) Goals {-}

This worked example shows:

Try modifying the code to import your own data!

b) Data set {-}

This code imports genetic data from 181 individuals of Colombia spotted frogs (Rana luteiventris) from 12 populations. The data are a subsample of the full data set analyzed in Funk et al. (2005) and Murphy et al. (2010). Please see the separate introduction to the data set.

c) Required R packages {-}

Install some packages needed for this worked example. Note: popgraph needs to be installed before installing gstudio.

if(!require("adegenet")) install.packages("adegenet")
if(!requireNamespace("popgraph", quietly = TRUE))
{
  install.packages(c("RgoogleMaps", "geosphere", "proto", "sampling", 
                      "seqinr", "spacetime", "spdep"), dependencies=TRUE)
  remotes::install_github("dyerlab/popgraph")
}
if(!requireNamespace("gstudio", quietly = TRUE)) remotes::install_github("dyerlab/gstudio")

Note: the function library will always load the package, even if it is already loaded, whereas require will only load it if it is not yet loaded. Either will work.

library(adegenet)
library(gstudio)
library(LandGenCourse)
library(tibble)
library(here)
library(vcfR)
library(pinfsc50)
library(utils)

2. Import data from .csv file {-}

a) Preparation {-}

Create a new R project for this lab: File > New Project > New Directory > New Project.

This will automatically set your working directory to the new project folder. That's where R will look for files and save anything you export.

Note: R does not distinguish between single quotes and double quotes, they are treated as synonyms, but the opening and closing symbols need to match.

The dataset ralu.loci is available in the package 'LandGenCourse' and can be loaded with data(ralu.loci).

data(ralu.loci)

Below, however, we'll import it from a comma separated file (.csv), as this is the recommended format for importing your own data (e.g. from Excel).

b) Excel file {-}

First we need to copy the file ralu.loci.csv to the downloads folder inside your project folder. The first line checks that the folder downloads exists and creates it if needed.

if(!dir.exists(paste0(here(),"/downloads"))) dir.create(paste0(here(),"/downloads"))
file.copy(system.file("extdata", "ralu.loci.csv", package = "LandGenCourse"),
          paste0(here(), "/downloads/ralu.loci.csv"), overwrite=FALSE)

Check that the file ralu.loci.csv is now in the folder downloads within your project folder. From your file system, you can open it with Excel. Note that in Excel, the columns with the loci must be explicitly defined as 'text', otherwise Excel is likely to interpret them as times.

Excel instructions:

Opening an existing .csv file with genetic data:

Creating a new Excel file to enter genetic data:

Saving Excel file for import in R:

fig.source <- system.file("extdata", "ExcelTable.png", package = "LandGenCourse")
fig.path <- paste0(here::here(), "/downloads/ExcelTable.png")
invisible(file.copy(from=fig.source, to=fig.path)) 

Here is a screenshot of the first few lines of the Excel file with the genetic data:

cat(paste0("![](", fig.path, "){width=80%}"))

c) Import with function 'read.csv' {-}

As the file is saved in csv format, we use the function 'read.csv' to import it into an R object called Frogs. Display the first few rows and columns to check whether the data have been imported correctly:

Frogs <- read.csv(paste0(here(), "/downloads/ralu.loci.csv"), header=TRUE)
as_tibble(Frogs)

d) Create ID variable {-}

We'll add unique ID's for the frogs (to show how these would be imported). Here we take, for each frog, a sub-string of Pop containing the first three letters, add the row number, and separate them by a period.

Frogs <- data.frame(FrogID = paste(substr(Frogs$Pop, 1, 3), row.names(Frogs), sep="."), Frogs)
as_tibble(Frogs)

e) Export with function 'write.csv' {-}

If you want to export (and import) your own files, adapt this code.

Note: when using R Notebooks, problems can arise if R looks for files in a different place (more on this in Week 8 Bonus material). Here, we use a safe work-around, with the function here, that will help make your work reproducible and transferable to other computers. For this to work properly, you must work within an R project, and you must have writing permission for the project folder.

The function here from the package here finds the path to your project folder. Remove the hashtags and run the two lines to see the path to your project folder, and the path to the (to be created) output folder. The function paste0 simply pastes strings of text together, without any separator (hence the '0' in the function name).

here() 
paste0(here(),"/output")

The following line checks whether an output folder exists in the project folder, and if not, creates one.

if(!dir.exists(paste0(here(),"/output"))) dir.create(paste0(here(),"/output"))

Now we can write our data into a csv file in the output folder:

write.csv(ralu.loci, paste0(here(),"/output/ralu.loci.csv"), 
          quote=FALSE, row.names=FALSE)

To import the file again from the output folder, you could do this:

ralu.loci.2 <- read.csv(paste0(here(),"/output/ralu.loci.csv"), header=TRUE)

3. Create 'genind' Object {-}

a) Read the helpfile {-}

We use the function df2genind of the adegenet package to import the loci into a genind object named Frogs.genind. Check the help file for a definition of all arguments and for example code.

?df2genind

b) Set all arguments of function 'df2genind' {-}

Some explanations:

Frogs.genind <- df2genind(X=Frogs[,c(4:11)], sep=":", ncode=NULL, ind.names= Frogs$FrogID, loc.names=NULL, pop=Frogs$Pop, NA.char="NA", ploidy=2, type="codom", strata=NULL, hierarchy=NULL)

c) Check imported data {-}

Printing the genind object's name lists the number of individuals, loci and alleles, and lists all slots (prefaced by @).

Frogs.genind

d) Summarize 'genind' object {-}

There is a summary function for genind objects. Notes:

summary(Frogs.genind)

4. View information stored in 'genind' object {-}

The summary lists each attribute or slot of the genind object and summarizes its content. Technically speaking, genind objects are S4 objects in R, which means that their slots are accessed with the @ sign, rather than the $ sign for attributes of the more commonly used S3 objects. Interestingly, the object does not store the data table that was imported, but converts it to a table of allele counts.

a) Slot @tab {-}

Table with allele counts, where each column is an allele. Allele name A.1 means "locus A, allele 1". For a codominant marker (e.g. microsatellite) and a diploid species, allele counts per individual can be 0, 1, 2, or NA=missing. Here are the first few lines:

as_tibble(Frogs.genind@tab)

b) Slot @loc.n.all {-}

Number of alleles per locus, across all populations.

Frogs.genind@loc.n.all

c) Slot @loc.fac {-}

This is a factor that indicates for each allele (column in @tab) which locus it belongs to. The levels of the factor correspond to the loci.

Frogs.genind@loc.fac

d) Slot @all.names {-}

List of allele names, separately for each locus. Note: the alleles are treated as text (character), even if they are coded as numbers. They are sorted in order of occurrence in data set.

Frogs.genind@all.names

e) Further slots {-}

The following slots are automatically filled with default values unless specified by user during import into genind object:

The following slots may be empty unless specified during or after import into 'genind' object:

5. Import data with 'gstudio' package {-}

a) Function 'read_population' {-}

While genind objects are used by many functions for analyzing genetic data, the gstudio package provides an interface for additional analysis. We use the function read_population to import the data.

?read_population

b) Set arguments {-}

Some explanations:

Working directory confusion: If you are working within an R project folder, then that folder is automatically set as your working directory. If you are working in an R session outside of an R project, then you need to actively set the working directory in the R Studio menu under: Session > Set working directory > Choose Directory.

Frogs.gstudio <- read_population(path=system.file("extdata", "ralu.loci.csv", 
                                                  package = "LandGenCourse"), 
                   type="separated", locus.columns=c(3:10), 
                   phased=FALSE, sep=",", header=TRUE)

c) Check imported data {-}

Check the column types: SiteName and Pop are now character vectors, and the loci columns A - H are now of class locus (it is indicated that 'locus' is a S3 object type).

str(Frogs.gstudio)

d) Add FrogID {-}

Frogs.gstudio is a data frame, and we can manipulate like any other data frame. Here we will add the variable FrogID as the first column.

Frogs.gstudio <- data.frame(FrogID=Frogs$FrogID, Frogs.gstudio) 
head(Frogs.gstudio)

e) Convert to 'genind' object {-}

We can use the dataframe with the locus objects from gstudio to import the data into a genind object. In this case, we set sep=":", because the locus object in gstudio stores the alleles at each locus separated by a colon, and NA.char="", because gstudio stores missing values as empty cells.

Frogs.genind2 <- adegenet::df2genind(X=Frogs.gstudio[,c(4:11)], sep=":", ncode=NULL,   
                          ind.names=Frogs.gstudio$FrogID, loc.names=NULL, 
                          pop=Frogs.gstudio$Pop, NA.char="", ploidy=2, 
                          type="codom", strata=NULL, hierarchy=NULL)
Frogs.genind2

6. Importing SNP data {-}

a) Import from text file {-}

First, we'll import a relatively small SNP dataset from a text file. This dataset will be analyzed in the Week 6 Worked Example. The data are stored in a tab-delimited text file.

Let's have a look at the first few rows and columns:

To import your own data, replace infile by the path to the text file with your SNP data. Change settings as necessary (check whether data are separated by tabs or commas!).

infile <- system.file("extdata", "WWP_SNP_genotypes.txt", package = "LandGenCourse")
Trees <- read.table(infile, header = TRUE, sep = "\t")
Trees[1:6, 1:6]

We should replace the SNP column names: if they contain a period ., then the function df2genind thinks that the first part indicates the locus and the second part the allele. The following line of code splits each column name by the period and only retains the first part.

names(Trees) <- unlist(lapply(names(Trees), function(x) strsplit(x, "[.]")[[1]][1]))

To import these data frame into a genind object, we need to tell R how the SNP data are coded (check the df2genind help file!)

Trees.genind <- adegenet::df2genind(X=Trees[,-c(1:2)], sep="", ncode=1,   
                          ind.names=Trees$family, loc.names=NULL, 
                          pop=Trees$population, NA.char="NA", ploidy=2, 
                          type="codom", strata=NULL, hierarchy=NULL)
Trees.genind

b) Import from vcf file to genind object {-}

A common data format for genomic data is vcf. Here's some background information (all but the first are vignettes from the vcfR package):

The R package vcfR makes it easy to import vcf files and convert to other formats. Here we use it to convert the data into a genind object. The following code is adapted from the above listed vcfR vignettes.

We'll import a file that is distributed with the pinfsc50 package. Note that vcf files are commonly compressed in gz archives. To import your own data, replace vcf_file by the path to your own gz archive.

vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
vcf <- read.vcfR( vcf_file, verbose = FALSE )

Note that the package vcfR can import three types of genomic data files listed below. We will only use the .vcf file here.

Convert to genind object:

SNP_genind <- vcfR2genind(vcf)
SNP_genind

c) Import from vcf file to genlight object {-}

This dataset has 18 individuals genotyped at 44,137. It may be better to use a genlight object instead of genind. This is an alternative object type from the package adegenet that can store large genomic data much more efficiently.

Note: the current implementation can only store two alleles per locus, loci with more alleles will be dropped and a warning is printed.

SNP_genlight <- vcfR2genlight(vcf)
SNP_genlight

Note that instead of 44,137 alleles, we now have 21,719 binary SNPs. This is one of the ways by which the genlight object is more efficient than genind. Overall, the size has been reduced from 16.4 Mb to 2.2 Mb!

7. R exercise Week 1 {-}

R Notebook: Create a new R Notebook for each weekly exercise. Watch the course video "Week 0: Intro to R Notebooks" as needed. At the end, "knit" it to an html file and view it in your browser.

Good news: if you can knit the file, the code can stand by itself (it does not depend on what dyou did in your R session before) and runs without errors. This is a good check. If there are error messages, check the 'R Markdown' tab for the code line number and try to fix it.

Task: Import the data set pulsatilla_genotypes, which we'll use in a later lab (Week 14), into gstudio and convert it to a genind object.

Data: This file contains microsatellite data for adults and seeds of the herb Pulsatilla vulgaris sampled at seven sites. Reference: DiLeo et al. (2018), Journal of Ecology 106:2242-2255. https://besjournals.onlinelibrary.wiley.com/doi/10.1111/1365-2745.12992

The following code copies the file into the 'downloads' folder in your R project folder:

file.copy(system.file("extdata", "pulsatilla_genotypes.csv", package = "LandGenCourse"),
          paste0(here(), "/downloads/pulsatilla_genotypes.csv"), overwrite=FALSE)

Variables:

Hints:

a) Load packages: Make sure the following packages are loaded: gstudio, here,tibble and adegenet. b) View data file: Adapt the code from section 2.c to import the raw data set. The file has a header row. View it. How are the genetic data coded? c) Import data into gstudio: Adapt the code from section 5.b to import the genetic data with 'gstudio'. The loci are in columns 6 - 19. What setting for type is appropriate for this data set? Check the help file for read_population. d) Check imported data: Use str or as_tibble to check the imported data. Does each variable have the correct data type? Note: there should be 7 variables of type locus. e) Check variable types: Create a bulleted list, like the one above, with the variables. For each variable, list their R data type (e.g., numeric, integer, character, logical, factor, locus). Check the cheatsheet "R markdown language" as needed. f) Convert to genind object: Modify the code from section 5.e to convert the data from gstudio to a genind object. Ignore the warning about duplicate labels. Print a summary of the genind object.

Question: What is the range of the number of genotyped individuals per population in this dataset?

# Note: this line of code detaches all packages that were loaded. This is necessary only for technical reasons of building the R package.
LandGenCourse::detachAllPackages()


hhwagner1/LandGenCourse documentation built on Feb. 17, 2024, 4:42 p.m.