In hhwagner1/LandGenCourse: Interface for course "Landscape Genetic Data Analysis with R"

1. Overview of Worked Example {-}

a) Goals {-}

This worked example shows:

How microsatellite data may be coded in a spreadsheet.
How such a file is imported into R.
How the data are imported into a 'genind' object using package 'adegenet'.
How to view information stored in a 'genind' object.
How to import the data with the 'gstudio' package.
How to import SNP data

Try modifying the code to import your own data!

b) Data set {-}

This code imports genetic data from 181 individuals of Colombia spotted frogs (Rana luteiventris) from 12 populations. The data are a subsample of the full data set analyzed in Funk et al. (2005) and Murphy et al. (2010). Please see the separate introduction to the data set.

ralu.loci: Data frame with populations and genetic data (181 rows x 9 columns). Included in package LandGenCourse.

c) Required R packages {-}

Install some packages needed for this worked example. Note: popgraph needs to be installed before installing gstudio.

if(!require("adegenet")) install.packages("adegenet")
if(!requireNamespace("popgraph", quietly = TRUE))
{
  install.packages(c("RgoogleMaps", "geosphere", "proto", "sampling", 
                      "seqinr", "spacetime", "spdep"), dependencies=TRUE)
  remotes::install_github("dyerlab/popgraph")
}
if(!requireNamespace("gstudio", quietly = TRUE)) remotes::install_github("dyerlab/gstudio")

Note: the function library will always load the package, even if it is already loaded, whereas require will only load it if it is not yet loaded. Either will work.

library(adegenet)
library(gstudio)
library(LandGenCourse)
library(tibble)
library(here)
library(vcfR)
library(pinfsc50)
library(utils)

2. Import data from .csv file {-}

a) Preparation {-}

Create a new R project for this lab: File > New Project > New Directory > New Project.

This will automatically set your working directory to the new project folder. That's where R will look for files and save anything you export.

Note: R does not distinguish between single quotes and double quotes, they are treated as synonyms, but the opening and closing symbols need to match.

The dataset ralu.loci is available in the package 'LandGenCourse' and can be loaded with data(ralu.loci).

data(ralu.loci)

Below, however, we'll import it from a comma separated file (.csv), as this is the recommended format for importing your own data (e.g. from Excel).

b) Excel file {-}

First we need to copy the file ralu.loci.csv to the downloads folder inside your project folder. The first line checks that the folder downloads exists and creates it if needed.

if(!dir.exists(paste0(here(),"/downloads"))) dir.create(paste0(here(),"/downloads"))
file.copy(system.file("extdata", "ralu.loci.csv", package = "LandGenCourse"),
          paste0(here(), "/downloads/ralu.loci.csv"), overwrite=FALSE)

Check that the file ralu.loci.csv is now in the folder downloads within your project folder. From your file system, you can open it with Excel. Note that in Excel, the columns with the loci must be explicitly defined as 'text', otherwise Excel is likely to interpret them as times.

Excel instructions:

Opening an existing .csv file with genetic data:

Excel menu: 'File > New Workbook'.
Excel menu: 'File > Import > CSV File > Import'. Navigate to file.
Text Import Wizard: select 'Delimited', click 'Next'.
Delimiters: check 'comma'. Click 'Next'.
Use the 'Shift' key to select all columns with loci A - H. Select 'text'. Click 'Finish > OK'.

Creating a new Excel file to enter genetic data:

Label the loci columns.
Select all loci columns by clicking on the column headers (hold the 'Shift' key to select multiple adjacent columns).
Define the data format: 'Format > Cell > Text'.
Enter genetic data, separating the alleles by a colon (e.g., '1:3').

Saving Excel file for import in R:

File Format: 'Comma Separated Values (.csv)'

fig.source <- system.file("extdata", "ExcelTable.png", package = "LandGenCourse")
fig.path <- paste0(here::here(), "/downloads/ExcelTable.png")
invisible(file.copy(from=fig.source, to=fig.path))

Here is a screenshot of the first few lines of the Excel file with the genetic data:

cat(paste0("![](", fig.path, "){width=80%}"))

c) Import with function 'read.csv' {-}

As the file is saved in csv format, we use the function 'read.csv' to import it into an R object called Frogs. Display the first few rows and columns to check whether the data have been imported correctly:

Frogs <- read.csv(paste0(here(), "/downloads/ralu.loci.csv"), header=TRUE)
as_tibble(Frogs)

The data frame Frogs contains 181 rows and 11 columns. Only the first few are shown.
The file has a header row with column names.
Notice the abbreviations for the R data type under each variable name: <fctr> means that the variable has been coded as a factor.
Each row is one individual. Note that for this data set, there is no column with ID or names of individuals.
The first two columns indicates the site in long ("SiteName") and short ("Pop") format.
There are 8 columns with loci, named A - H. Each locus has two alleles, coded with numbers and separated by a colon (:).
This coding suggests that the markers are codominant and that the species is diploid.
The code for missing values is NA, (which shows up as NA:NA for missing genetic data). There seem to be quite a few missing values.

d) Create ID variable {-}

We'll add unique ID's for the frogs (to show how these would be imported). Here we take, for each frog, a sub-string of Pop containing the first three letters, add the row number, and separate them by a period.

Frogs <- data.frame(FrogID = paste(substr(Frogs$Pop, 1, 3), row.names(Frogs), sep="."), Frogs)
as_tibble(Frogs)

e) Export with function 'write.csv' {-}

If you want to export (and import) your own files, adapt this code.

Note: when using R Notebooks, problems can arise if R looks for files in a different place (more on this in Week 8 Bonus material). Here, we use a safe work-around, with the function here, that will help make your work reproducible and transferable to other computers. For this to work properly, you must work within an R project, and you must have writing permission for the project folder.

The function here from the package here finds the path to your project folder. Remove the hashtags and run the two lines to see the path to your project folder, and the path to the (to be created) output folder. The function paste0 simply pastes strings of text together, without any separator (hence the '0' in the function name).

here() 
paste0(here(),"/output")

The following line checks whether an output folder exists in the project folder, and if not, creates one.

if(!dir.exists(paste0(here(),"/output"))) dir.create(paste0(here(),"/output"))

Now we can write our data into a csv file in the output folder:

write.csv(ralu.loci, paste0(here(),"/output/ralu.loci.csv"), 
          quote=FALSE, row.names=FALSE)

To import the file again from the output folder, you could do this:

ralu.loci.2 <- read.csv(paste0(here(),"/output/ralu.loci.csv"), header=TRUE)

3. Create 'genind' Object {-}

a) Read the helpfile {-}

We use the function df2genind of the adegenet package to import the loci into a genind object named Frogs.genind. Check the help file for a definition of all arguments and for example code.

?df2genind

b) Set all arguments of function 'df2genind' {-}

Some explanations:

X: this is the data frame (or matrix) containing the loci only. Hence we need columns 4 - 11.
sep: need to specify the separator between alleles (here: colon)
ncode: not needed here because we used a separator, defined under sep.
ind.names: here we should indicate the column FrogID.
loc.names: by default this will use the column names of the loci.
pop: here we should indicate either the column Pop or SiteName.
NA.char: how were missing values coded? Here: NA.
ploidy: this species is diploid, hence 2.
type: marker type, either "codom" for codominant markers like microsatellites, or "PA" for presence-absence, such as SNP or AFLP markers. Note: you can use either "" or ''.
strata: one way of defining hierarchical sampling levels. We will add this in Week 3.
hierarchy: another way of defining hierarchical sampling levels. We'll skip it for now.

Frogs.genind <- df2genind(X=Frogs[,c(4:11)], sep=":", ncode=NULL, ind.names= Frogs$FrogID, loc.names=NULL, pop=Frogs$Pop, NA.char="NA", ploidy=2, type="codom", strata=NULL, hierarchy=NULL)

c) Check imported data {-}

Printing the genind object's name lists the number of individuals, loci and alleles, and lists all slots (prefaced by @).

Frogs.genind

d) Summarize 'genind' object {-}

There is a summary function for genind objects. Notes:

Group here refers to the populations, i.e., Group sizes means number of individuals per population.
It is a good idea to check the percentage of missing values here. If it is 0%, this may indicate that the coding for missing values was not recognized. If the % missing is very high, on the other hand, there may be some other importing problem.

summary(Frogs.genind)

4. View information stored in 'genind' object {-}

The summary lists each attribute or slot of the genind object and summarizes its content. Technically speaking, genind objects are S4 objects in R, which means that their slots are accessed with the @ sign, rather than the $ sign for attributes of the more commonly used S3 objects. Interestingly, the object does not store the data table that was imported, but converts it to a table of allele counts.

a) Slot `@tab` {-}

Table with allele counts, where each column is an allele. Allele name A.1 means "locus A, allele 1". For a codominant marker (e.g. microsatellite) and a diploid species, allele counts per individual can be 0, 1, 2, or NA=missing. Here are the first few lines:

as_tibble(Frogs.genind@tab)

b) Slot `@loc.n.all` {-}

Number of alleles per locus, across all populations.

Frogs.genind@loc.n.all

c) Slot `@loc.fac` {-}

This is a factor that indicates for each allele (column in @tab) which locus it belongs to. The levels of the factor correspond to the loci.

Frogs.genind@loc.fac

d) Slot `@all.names` {-}

List of allele names, separately for each locus. Note: the alleles are treated as text (character), even if they are coded as numbers. They are sorted in order of occurrence in data set.

Frogs.genind@all.names

e) Further slots {-}

The following slots are automatically filled with default values unless specified by user during import into genind object:

Slot @ploidy: ploidy for each individual.
Slot @type: codominant (microsatellites) or presence/absence (SNP, AFLP)

The following slots may be empty unless specified during or after import into 'genind' object:

Slot @other: placeholder for non-genetic data, e.g. spatial coordinates of individuals.
Slot @pop: vector indicating the population that each individual belongs to.
Slot @strata: stratification of populations, e.g. within regions or treatments.
Slot @hierarchy: optional formula defining hierarchical levels in strata.

5. Import data with 'gstudio' package {-}

a) Function 'read_population' {-}

While genind objects are used by many functions for analyzing genetic data, the gstudio package provides an interface for additional analysis. We use the function read_population to import the data.

?read_population

b) Set arguments {-}

Some explanations:

path: This is the path to the .csv file. If it is in the current working directory, then the filename is sufficient.
type: here we have 'separated' data, as the alleles at each locus are separated by a colon.
locus.columns: this would be columns 3 - 10 in this case, because we are importing from the .csv file directly, which does not contain the FrogID variable. We can add that later.
phased: default is fine here.
sep: here this refers to the .csv file! The separater is a comma.
header: yes we have column names.

Working directory confusion: If you are working within an R project folder, then that folder is automatically set as your working directory. If you are working in an R session outside of an R project, then you need to actively set the working directory in the R Studio menu under: Session > Set working directory > Choose Directory.

Frogs.gstudio <- read_population(path=system.file("extdata", "ralu.loci.csv", 
                                                  package = "LandGenCourse"), 
                   type="separated", locus.columns=c(3:10), 
                   phased=FALSE, sep=",", header=TRUE)

c) Check imported data {-}

Check the column types: SiteName and Pop are now character vectors, and the loci columns A - H are now of class locus (it is indicated that 'locus' is a S3 object type).

str(Frogs.gstudio)

d) Add FrogID {-}

Frogs.gstudio is a data frame, and we can manipulate like any other data frame. Here we will add the variable FrogID as the first column.

Frogs.gstudio <- data.frame(FrogID=Frogs$FrogID, Frogs.gstudio) 
head(Frogs.gstudio)

e) Convert to 'genind' object {-}

We can use the dataframe with the locus objects from gstudio to import the data into a genind object. In this case, we set sep=":", because the locus object in gstudio stores the alleles at each locus separated by a colon, and NA.char="", because gstudio stores missing values as empty cells.

Frogs.genind2 <- adegenet::df2genind(X=Frogs.gstudio[,c(4:11)], sep=":", ncode=NULL,   
                          ind.names=Frogs.gstudio$FrogID, loc.names=NULL, 
                          pop=Frogs.gstudio$Pop, NA.char="", ploidy=2, 
                          type="codom", strata=NULL, hierarchy=NULL)
Frogs.genind2

6. Importing SNP data {-}

a) Import from text file {-}

First, we'll import a relatively small SNP dataset from a text file. This dataset will be analyzed in the Week 6 Worked Example. The data are stored in a tab-delimited text file.

Let's have a look at the first few rows and columns:

family: these are ID's of the genotyped tree individuals (from which seeds were grown in a common-garden experiment, hence they are called "family" in the dataset).
population: the trees were sampled from a set of populations.
snpXXX.Plmn: each column is a diploid SNP.

To import your own data, replace infile by the path to the text file with your SNP data. Change settings as necessary (check whether data are separated by tabs or commas!).

infile <- system.file("extdata", "WWP_SNP_genotypes.txt", package = "LandGenCourse")
Trees <- read.table(infile, header = TRUE, sep = "\t")
Trees[1:6, 1:6]

We should replace the SNP column names: if they contain a period ., then the function df2genind thinks that the first part indicates the locus and the second part the allele. The following line of code splits each column name by the period and only retains the first part.

names(Trees) <- unlist(lapply(names(Trees), function(x) strsplit(x, "[.]")[[1]][1]))

To import these data frame into a genind object, we need to tell R how the SNP data are coded (check the df2genind help file!)

ncode = 1: each allele (A, C, G, T) is coded with a single character.
sep = "": the two alleles per individual and locus are not separated by any symbol.
NA.char= "NA": missing values are coded as NA.

Trees.genind <- adegenet::df2genind(X=Trees[,-c(1:2)], sep="", ncode=1,   
                          ind.names=Trees$family, loc.names=NULL, 
                          pop=Trees$population, NA.char="NA", ploidy=2, 
                          type="codom", strata=NULL, hierarchy=NULL)
Trees.genind

b) Import from vcf file to genind object {-}

A common data format for genomic data is vcf. Here's some background information (all but the first are vignettes from the vcfR package):

PDF defining the data format: https://samtools.github.io/hts-specs/VCFv4.3.pdf
Tutorial on data format: https://cran.r-project.org/web/packages/vcfR/vignettes/vcf_data.html
Tutorial on quality testing: https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html
Tutorial on converting vcf data: https://cran.r-project.org/web/packages/vcfR/vignettes/converting_data.html

The R package vcfR makes it easy to import vcf files and convert to other formats. Here we use it to convert the data into a genind object. The following code is adapted from the above listed vcfR vignettes.

We'll import a file that is distributed with the pinfsc50 package. Note that vcf files are commonly compressed in gz archives. To import your own data, replace vcf_file by the path to your own gz archive.

vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
vcf <- read.vcfR( vcf_file, verbose = FALSE )

Note that the package vcfR can import three types of genomic data files listed below. We will only use the .vcf file here.

.vcf: SNPs
.fasta: genomic reference sequence data
.gff: annotation

Convert to genind object:

SNP_genind <- vcfR2genind(vcf)
SNP_genind

c) Import from vcf file to genlight object {-}

This dataset has 18 individuals genotyped at 44,137. It may be better to use a genlight object instead of genind. This is an alternative object type from the package adegenet that can store large genomic data much more efficiently.

Note: the current implementation can only store two alleles per locus, loci with more alleles will be dropped and a warning is printed.

SNP_genlight <- vcfR2genlight(vcf)
SNP_genlight

Note that instead of 44,137 alleles, we now have 21,719 binary SNPs. This is one of the ways by which the genlight object is more efficient than genind. Overall, the size has been reduced from 16.4 Mb to 2.2 Mb!

7. R exercise Week 1 {-}

R Notebook: Create a new R Notebook for each weekly exercise. Watch the course video "Week 0: Intro to R Notebooks" as needed. At the end, "knit" it to an html file and view it in your browser.

Good news: if you can knit the file, the code can stand by itself (it does not depend on what dyou did in your R session before) and runs without errors. This is a good check. If there are error messages, check the 'R Markdown' tab for the code line number and try to fix it.

Task: Import the data set pulsatilla_genotypes, which we'll use in a later lab (Week 14), into gstudio and convert it to a genind object.

Data: This file contains microsatellite data for adults and seeds of the herb Pulsatilla vulgaris sampled at seven sites. Reference: DiLeo et al. (2018), Journal of Ecology 106:2242-2255. https://besjournals.onlinelibrary.wiley.com/doi/10.1111/1365-2745.12992

The following code copies the file into the 'downloads' folder in your R project folder:

file.copy(system.file("extdata", "pulsatilla_genotypes.csv", package = "LandGenCourse"),
          paste0(here(), "/downloads/pulsatilla_genotypes.csv"), overwrite=FALSE)

Variables:

ID: Family ID (i.e., mother and her offspring have the same ID)
OffID: '0' for adults, seeds from the same mom are numbered 1, 2, etc.
Population: site ID
Coordinates: X and Y coordinates (Projection info: EPSG Projection 31468)
Loci: seven diploid microsatellites, each with two columns (1 allele per column)

Hints:

a) Load packages: Make sure the following packages are loaded: gstudio, here,tibble and adegenet. b) View data file: Adapt the code from section 2.c to import the raw data set. The file has a header row. View it. How are the genetic data coded? c) Import data into gstudio: Adapt the code from section 5.b to import the genetic data with 'gstudio'. The loci are in columns 6 - 19. What setting for type is appropriate for this data set? Check the help file for read_population. d) Check imported data: Use str or as_tibble to check the imported data. Does each variable have the correct data type? Note: there should be 7 variables of type locus. e) Check variable types: Create a bulleted list, like the one above, with the variables. For each variable, list their R data type (e.g., numeric, integer, character, logical, factor, locus). Check the cheatsheet "R markdown language" as needed. f) Convert to genind object: Modify the code from section 5.e to convert the data from gstudio to a genind object. Ignore the warning about duplicate labels. Print a summary of the genind object.

Question: What is the range of the number of genotyped individuals per population in this dataset?

# Note: this line of code detaches all packages that were loaded. This is necessary only for technical reasons of building the R package.
LandGenCourse::detachAllPackages()

hhwagner1/LandGenCourse documentation built on Feb. 17, 2024, 4:42 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hhwagner1/LandGenCourse
Interface for course "Landscape Genetic Data Analysis with R"

In hhwagner1/LandGenCourse: Interface for course "Landscape Genetic Data Analysis with R"

1. Overview of Worked Example {-}

a) Goals {-}

b) Data set {-}

c) Required R packages {-}

2. Import data from .csv file {-}

a) Preparation {-}

b) Excel file {-}

c) Import with function 'read.csv' {-}

d) Create ID variable {-}

e) Export with function 'write.csv' {-}

3. Create 'genind' Object {-}

a) Read the helpfile {-}

b) Set all arguments of function 'df2genind' {-}

c) Check imported data {-}

d) Summarize 'genind' object {-}

4. View information stored in 'genind' object {-}

a) Slot `@tab` {-}

b) Slot `@loc.n.all` {-}

c) Slot `@loc.fac` {-}

d) Slot `@all.names` {-}

e) Further slots {-}

5. Import data with 'gstudio' package {-}

a) Function 'read_population' {-}

b) Set arguments {-}

c) Check imported data {-}

d) Add FrogID {-}

e) Convert to 'genind' object {-}

6. Importing SNP data {-}

a) Import from text file {-}

b) Import from vcf file to genind object {-}

c) Import from vcf file to genlight object {-}

7. R exercise Week 1 {-}

R Package Documentation

Browse R Packages

We want your feedback!

hhwagner1/LandGenCourse Interface for course "Landscape Genetic Data Analysis with R"

In hhwagner1/LandGenCourse: Interface for course "Landscape Genetic Data Analysis with R"

1. Overview of Worked Example {-}

a) Goals {-}

b) Data set {-}

c) Required R packages {-}

2. Import data from .csv file {-}

a) Preparation {-}

b) Excel file {-}

c) Import with function 'read.csv' {-}

d) Create ID variable {-}

e) Export with function 'write.csv' {-}

3. Create 'genind' Object {-}

a) Read the helpfile {-}

b) Set all arguments of function 'df2genind' {-}

c) Check imported data {-}

d) Summarize 'genind' object {-}

4. View information stored in 'genind' object {-}

a) Slot @tab {-}

b) Slot @loc.n.all {-}

c) Slot @loc.fac {-}

d) Slot @all.names {-}

e) Further slots {-}

5. Import data with 'gstudio' package {-}

a) Function 'read_population' {-}

b) Set arguments {-}

c) Check imported data {-}

d) Add FrogID {-}

e) Convert to 'genind' object {-}

6. Importing SNP data {-}

a) Import from text file {-}

b) Import from vcf file to genind object {-}

c) Import from vcf file to genlight object {-}

7. R exercise Week 1 {-}

R Package Documentation

Browse R Packages

We want your feedback!

hhwagner1/LandGenCourse
Interface for course "Landscape Genetic Data Analysis with R"

a) Slot `@tab` {-}

b) Slot `@loc.n.all` {-}

c) Slot `@loc.fac` {-}

d) Slot `@all.names` {-}