We use geogendivr and geogendivrdata to perform this analysis.

Install and load the r-package geogendivr

Installation

#if (!requireNamespace("devtools", quietly = TRUE)) { install.packages("devtools") }
#library(devtools)
#devtools::install_github("Grelot/geogendivr")

Loading package

library(geogendivr)

BOLD dataset

BOLD (Barcode Of Life Database) is a database of Barcode DNA sequences of georeferenced specimen that closely approximate species.

geogendivr provides a sample of a BOLD request for the taxon "Pomacanthidae" as a dataset. We use this dataset as an example to test functions of the package geogendivr.

First we need to load the resBold dataset.

##taxonRequest <- "Actinopterygii"
##resBold <- bold_seqspec(taxon=taxonRequest, sepfasta=TRUE)
## load taxon request "Pomacanthidae" sample from BOLD
data(requestPomacanthidaeBOLD)

resBold is a list of objects returned by bold_seqspec command from bold package. They are 2 objects:

dataframe of specimen information (spatial coordinates, taxonomy...) list of DNA barcode sequences. Each row is related to an individual sequences. They are 725 published records, with 725 records with sequences, forming 71 BINs (clusters), with specimens from 46 countries, deposited in 27 institutions.

## rows is the number of individuals, columns the number of information descriptors
dim(resBold$data)
## number of barcode sequences
length(resBold$fasta)
## names of the information fields
names(resBold$data)

For the next steps, most important fields are species_name, lat, lon and marker_codes

Reef Life Survey dataset

Reef Life Survey is a set of size and abundance data from thousands of reef-dwelling species recorded on RLS transects across over thousands of sites worldwide

geogendivr provides two reef life survey dataframes:

## load species Reef Life Survey dataframe
data(reefishSurveySpecies)
## a thorough description of this dataframe is available
#help(reefishSurveySpecies)
## load social environmental Reef Life Survey dataframe
data(reefishSurveyEnvSocio)
## a thorough description of this dataframe is available
#help(reefishSurveyEnvSocio)
plot_reefish_survey(reefishSurveyEnvSocio, 500)

Here we visualize geographical distribution of Reef Life Survey points clusters (within 500km distance).

Prepare dataset

1. Mutate and filter raw BOLD dataset

Filter and mutate BOLD dataset to produce a curated dataframe with rows as individual specimen and columns as specimen information. It adds a new column sequence with fasta sequences as string.

The function prepare_bold_res apply 5 filters :

  1. Select individuals with given marker_code
  2. Remove individuals with no species_name information
  3. Remove individuals with no lat or lon coordinates information
  4. Remove DNA sequences with IUAPC ambiguities
  5. Select DNA sequences within a given range of lengths in bp
## filter and mutate
prparedResBold <- prepare_bold_res(resBold,
                                   marker_code="COI-5P",
                                   species_names=TRUE, 
                                   coordinates=TRUE, 
                                   ambiguities=TRUE, 
                                   min_length=420,
                                   max_length=720
                                  )

2. Validate species names with fishbase

As we work on fishes and later with Reef Life Survey dataset, we search for synonyms into fishbase to validate species names from the BOLD dataset.

The function fishbase_name_species_bold checks species_name field and seek for fishbase synonyms. Then it adds a new field fishbase_species_name.

## validate species names
prparedResBold.fishbaseValid <- fishbase_name_species_bold(prparedResBold)

3. Cross BOLD dataset with Reef Life Survey dataset

As we work on Reef Life Survey dataset, we want to keep only species which are described in Reef Life Survey. The function select_reefish_species:

  1. selects sequences and fishbase-validated species name from BOLD which are in Reef Life Survey database.
  2. keeps species with a minimum number of individuals sequences
reefishBold <- select_reefish_species(prparedResBold.fishbaseValid,
                                      reefishSurveySpecies,
                                      countSequencesbySpeciesThreshold=2
                                     ) 

4. Transform curated BOLD dataframe into sp spatialpoints object

To work with spatial environmental data from Reef Life Survey, we transform our BOLD dataframe into spatialpoints object with the right projection.

reefishBold.sp <- spatialpoints_bold(reefishBold, projectionCRS="+init=epsg:3347")

5. Get the matrix of presence/absence of a sequence in a survey geographical point from RLS

We calculate a buffer of 250km around each RLS Survey point. We generate a matrix of presence/absence of each BOLD georeferenced sequence within a buffer of an RLS Survey.

boldWithinRLS <- sequences_within_buffer(latitude=reefishSurveyEnvSocio$SiteLatitude,
                        longitude=reefishSurveyEnvSocio$SiteLongitude,
                        boldSp=reefishBold.sp,
                        bufferDistance=250,
                        projection="+init=epsg:3347"
                        )

Nucleotide diversity

We gather together sequences from the same species located within the same RLS Survey geographical buffer. Then sequences are aligned and nucleotide diversity is calculated for each species within each RLS Survey geographical buffer.

nucdivSpecies <- species_nucleotide_diversity(infobold=reefishBold, 
                             sequenceWithinBuffer=boldWithinRLS,
                             MinimumNumberOfSequencesBySpecies=3
                             ) 

Merge RLS Survey information with nucleotide diversity

nucdivSpeciesSurveyInfo <- merge_info_nucdiv(nucdivSpecies, reefishSurveyEnvSocio, reefishSurveySpecies)


Grelot/geogendivr documentation built on Sept. 3, 2020, 6:25 p.m.