Datasets provided in the `R` package `ggene`

knitr::knit_hooks$set(mysize = function(before, options, envir) {
  if (before) 
    return(options$size)
})

[//]: mysize=TRUE, size="\small" \newpage

knitr::knit_hooks$set(mysize = function(before, options, envir) {
  if (before) 
    return(options$size)
})

Lists of datasets

ggene objects

The package comes with 8 datasets used to examplify the functions and show how the results can be interpreted. These datasets can be accessed using the function data(name of the dataset). The table below provides a brief description of each dataset. Typing ?name of the dataset will open the dedicated help page.

| name | description | |------|------| | aniso | simulated haploid genotypic dataset exhibiting zonal anistropy | |crypho| haploid microsatellite dataset for the chestnut blight fungus Cryphonectria parasitica | | | [@dutech2008geostatistical] | |larix1350| diploid dataset corresponding to a set of 199 European larch trees Larix decidua | | | sampled at the altitude 1350 m asl [@nardin2015genetic] | |larix2300| diploid dataset corresponding to a set of 175 European larch trees Larix decidua | | | sampled at the altitude 2300 m asl [@nardin2015genetic] | |sim01| simulated haploid genotypic dataset exhibiting clear spatial structure | |sim02| simulated haploid genotypic dataset with no spatial structure | |sim03| a subset of sim01 | |Wagner| toy example used in Wagner et al [-@wagner2005variogram] p. 1750 |

Other data types

Several data files are available in the directory extdata located in the R installation directory. The path to this directory is indicated by the function system.file():

system.file(package="ggene")

These files are stored on the extdata directory and the path is thus given by typing:

system.file("extdata/",package="ggene")

| name | description | |------|------| | sim_01.csv | the file contains the genetic data corresponding | | | to dataset 'sim01' (see above) | | sim_03.gen | a genepop file with 625 individuals | | xysim_01.csv | the file contains the x y coordinates of the individuals in | | | the dataset 'sim01' (see above) |

The code below shows how the files can be loaded and used in R.

library(ggene)
sim <- read.csv(system.file("extdata/sim_01.csv",package="ggene"), 
                header=FALSE)
xy.sim <- read.csv(system.file("extdata/xysim_01.csv",package="ggene"), 
                   header=FALSE)
dat.sim <- tab2geo(X=sim, coord=xy.sim)
class(dat.sim)

Reading genepop files can be achieved using read.genepop from the package adegenet:

library(adegenet)
dat <- read.genepop(system.file("extdata/sim_03.gen",package="ggene"), ncode = 3) 
xy <- read.csv(system.file("extdata/xysim_01.csv",package="ggene"),
               header=FALSE)[1:dim(dat$tab)[1],]
data <- gene2geo(X=dat, coord=xy)
class(data)
str(data)

Information on each dataset

aniso

aniso is a simulated dataset illustrating how variogram can be used to explore spatial anisotropies of a microsatellite dataset. aniso is provided as a ggene object. Anistropy can be explored using directional variograms (function svariog) or variogram maps (function varmap).

Directional variograms are variograms computed for a given set of direction and tolerance [@goovaerts1997geostatistics]. svariog allows a straightforward analysis of anisotropy. First, the omnidirectional variogram is computed using svariog. If the argument plot is set to TRUE, the function plots the variogram.

library(ggene)
data(aniso)
va <- svariog(X=aniso, plot=TRUE)

Each directional variogram corresponds to a direction and a tolerance. We here use 4 directions ranging from 0 to 135 degrees, with a tolerance of 22.5 degrees.

d0_225 <- svariog(X=aniso,direction=0, tolerance=22.5, 
                  unit.angle="degrees")
d45_225 <- svariog(X=aniso,direction=45, tolerance=22.5, 
                   unit.angle="degrees")
d90_225 <- svariog(X=aniso,direction=90, tolerance=22.5, 
                   unit.angle="degrees")
d135_225 <- svariog(X=aniso,direction=135, tolerance=22.5, 
                    unit.angle="degrees")
plot(va$svario$u, va$svario$v, type="b", 
ylim=range(c(va$svario$v, d0_225$svario$v, d45_225$svario$v,
d90_225$svario$v,  d135_225$svario$v)),xlab="distance (m)", 
ylab="semivariance")
points(d0_225$svario$u, d0_225$svario$v, type="b", lty=2)
points(d45_225$svario$u, d45_225$svario$v, type="b", col="red", lty=2)
points(d90_225$svario$u, d90_225$svario$v, type="b", col="blue", lty=2)
points(d135_225$svario$u, d135_225$svario$v, type="b", col="green", lty=2)
legend("topleft", legend=c("omnidirectional", expression(0 * degree), 
expression(45 * degree), expression(90 * degree), expression(135 * degree)), 
lty=c(1,2,2,2,2,2), col=c("black","black","red","blue","green"), bty="n")

It can be seen that the directional variogram for the direction of 45° exhibits a clear departure from the omnidirectional variogram.

Anisotropy can be assessed using another approach, the variogram map or 'variogram surface' (Isaaks & Srivastava [-@isaaks1989applied] p. 149). The parameter cutoff indicates the maximum lag distance to be considered. width is the lag distance increment.

map <- svarmap(X=aniso, cutoff=20, width=1)
plot(map)

As with the directional variograms, the variogram map reveals the presence of a direction for which the semivariance (here gene diversity) is lower (pink areas on the map). This direction is 45°.

crypho

crypho consists of a set of 10 locus for 276 individuals of the chestnut blight fungus \emph{Cryphonectria parasitica} (CBF). This organism is haploid and a certain level of clonality is observed [@dutech2008geostatistical]. This data set is therefore an interesting case study if we are to explore how repeated genetypes affect the observed genetic variation. Following the proposition of Wagner et al [-@wagner2005variogram], we will use a weighting scheme to account for recurrent genotypes.

data(crypho)

The sampling scheme can be examined prior to analyses. The coordinates of the individuals are stored in the $coord slot of the ggene object.

# check sampling scheme
plot(crypho$coord[,1],crypho$coord[,2], xlab="x", ylab="y", asp=1)

The variogram can be computed using the function svariog:

d <- distlag(dist=crypho$coord, dmin=0,distance.lag=50)
va <- svariog(X=crypho, uvec=d, plot=FALSE)
#plot raw variogram
plot(va$svario$u, va$svario$v, col="black", type="b",
      xlab="distance (m)", ylab="semivariance")

Computing the weighted variogram requires the matrix of weights. It is produced by 2 functions, genocount and genoweight.

# compute matrix of weights
count <- genocount(X=crypho)
mat <- genoweight(X=crypho,genotyp=count$vec)

The matrix of weights mat is then used to compute the weighted variogram using the function varioWeight.

wva <- varioWeight(X=crypho, weights=mat,  uvec=d)

We plot the variograms:

plot(wva$svario$u, wva$svario$gamma, col="black", type="b",
     ylim=range(c(wva$svario$gamma,wva$svario$v)), pch=16,
     xlab="distance (m)", ylab="semivariance")
#add the variogram for raw data
points(wva$svario$u, wva$svario$v, col="red", type="b", pch=15,
       lty="dotted")

legend("top", legend=c("raw", "weighted"), col=c("black", "red"),
       pch=c(16,15), lty=c("solid", "dotted"), bty="n")

The weighted variogram is flat which revals that accounting for recurrent genotypes strongly decreased the amount of spatial structure.

An interesting question is to explore to which extent the spatial structure has been removed. For that purpose, we can use the function randsvariog to compute the statistical envelopes for both raw and weighted variograms.

#performs randomization on raw variogram
va <- svariog(X=crypho, plot=FALSE)
env <- randsvariog(var=va, X=crypho, nsim=9, bounds=NULL, 
                   save.sim=FALSE)

#compute the weighted variogram
wva <- varioWeight(X=crypho, weights=mat)

#performs the randomizations on weighted variogram
env2 <- randsvariog(var=wva, X=crypho, nsim=9, bounds=NULL, 
                    save.sim=FALSE, weights=mat)

We plot the results:

# plot results
xx <- c(wva$svario$u, rev(wva$svario$u))
yy <- c(env$env[,1], rev(env$env[,2]))
plot(xx, yy, type = "n", xlab = "distance", ylab = "semivariance", 
    ylim=range(c(env$env[,1], env$env[,2], env2$env[,1], env2$env[,2])))
polygon(xx, yy, col = "lightgrey", border = "black")
xx <- c(wva$svario$u, rev(wva$svario$u))
yy <- c(env2$env[,1], env2$env[,2])
points(xx, yy, type = "l")
polygon(xx, yy, col = "lightblue", border = "blue")

points(wva$svario$u, wva$svario$v, col="blue", typ="b")
points(wva$svario$u, wva$svario$gamma, col="black", type="b", 
    lty="solid", bty="n")

Is the spatial genetic variation of \emph{C. parasitica} isotropic? A simple way to answer the question is to compute the variogram map.

 map <- svarmap(X=crypho,cutoff=500, width=50)
plot(map)

Because some cells may contain only few data pairs, their associated semivariance estimate might be unriable. These cells can be removed from the map using the parameter threshold which default value = 5. Semivariance values derived from a number of data pairs lower than the threshold are not printed on the map.

plot(map, threshold=200)

larix1350

larix1350 corresponds to the microsatellite dataset for 13 microsatellite locus of 189 individuals of European larch (Larix decidua) sampled in an experimental plot at the altitude of 1350 m asl. The survey was undertaken near the village of Villar-Saint-Pancrace (Hautes-Alpes, France) [@nardin2015genetic].

The spatial distribution of the trees can be mapped:

data(larix1350)
plot(larix1350$coord[,1],larix1350$coord[,2], xlab="x", ylab="y", asp=1)

The variogram shows very little structuration, if any.

# compute variogram
va <- svariog(X=larix1350, uvec=distlag(dist=larix1350$coord, dmin=0,
  distance.lag=3), plot=FALSE)
plot(va$svario$u, va$svario$v, xlab="distance (m)", ylab="semivariance")

The statistical envelope indicates that most of the semivariance estimates lie between the 0.025 and 0.975 quantiles. Note that the number of randomizations was kept low in this example while regular data analyses should rely on a much larger number.

# compute statistical envelope
env <- randsvariog(var=va, X=larix1350, nsim=30, bounds=c(0.025, 0.975),
                   save.sim=FALSE)

# plot results
plot(env$svario$u, env$svario$v, ylim=range(env$env), xlab="distance (m)",
     ylab="semivariance")
points(env$svario$u, env$env[,1], type="l")
points(env$svario$u, env$env[,2], type="l")

larix2300

Another dataset for L. decidua collected near the village of Villar-Saint-Pancrace (Hautes-Alpes, France) but this time the sampled population is located at higher alstitude: 2300 m a.s.l [@nardin2015genetic].

data(larix2300)
plot(larix2300$coord[,1],larix2300$coord[,2], xlab="x", ylab="y", asp=1)

The variogram indicates, in this case, a clear spatial genetic structure :

# compute variogram
va <- svariog(X=larix2300, uvec=distlag(dist=larix2300$coord, dmin=0,
  distance.lag=3), plot=FALSE)

# compute statistical envelope
env <- randsvariog(var=va, X=larix2300, nsim=30, bounds=c(0.025, 0.975),
  save.sim=FALSE)

# plot results
plot(env$svario$u, env$svario$v, ylim=range(env$env), xlab="distance (m)",
     ylab="semivariance")
points(env$svario$u, env$env[,1], type="l")
points(env$svario$u, env$env[,2], type="l")

Because the variogram reveals a spatial genetic structure, fitting an exponential model will provide estimates of various parameters synthesizing the SGS at hand. Since the variogram reaches a plateau at ca. 30 m but rises again for distances $>$ 60 m, it is useful to restrain the distance range used in the fitting procedure. This is achieved by setting max.dist to 40 m.

# fit exponential model to the empirical variogram
fit <- fitsvariog(vario=va, ini.cov.pars=c(0.1,20), nugget=0.3, 
  max.dist=30, plot = FALSE)

# plot results
plot(va$svario$u, va$svario$v, xlim=c(0, 40), 
    xlab="distance (m) (m)", ylab="semivariance")
lines(fit$fit)

The estimate of the parameters are given in:

fit$param

The variogram is the sum of the semivariance estimated for each locus. It may be useful, in some cases, to examine if the spatial signal changes according to the locus considered. This is possible with the slot $bylocus returned by the function svariog where the locus-by-locus semivariances are stored.

# compute variogram
va <- svariog(X=larix2300, uvec=distlag(dist=larix2300$coord, dmin=0,
  distance.lag=3), plot=FALSE)

# plot semivariance locus by locus
va <- svariog(X=larix2300)
plot(va$svario$u,va$bylocus[[1]]$gamma.by.locus, xlab="distance (m)",
  ylab="semivariance", type="n", ylim=c(0,0.5))
cols <- rainbow(length(va$bylocus))

for(i in 1:(length(va$bylocus))){
  points(va$svario$u,va$bylocus[[i]]$gamma.by.locus, type="l", col=cols[i])
  }
legend("bottomleft", legend=larix2300$locnames, col=cols, bty="n", lty=1,
  ncol=3)

sim01, sim02 and sim03

These datasets were generated by simulation with the software IBDsim (Leblois et al 2009). - sim01 corresponds to data under SMM generated using the following parameters: 625 gene copies, 20 loci, 25 x 25 haploid individuals evolving at G=0 on a 300 x 300 lattice with absorbing boundaries, mutation proba=0.001, Mrca Moy= 286842, MRCA MAX=617227. - sim02 was constructed by taking the first 100 individuals from sim01 and modifying their spatial coordinates so as to create a clumped distribution and randomizing the genotypes (thus removing the spatial genetic structure). - sim03 was constructed by taking the first 100 individuals from sim01.

data(sim01)
# compute variogram
va <- svariog(X=sim01, uvec=distlag(dist=sim01$coord, dmin=1,
              distance.lag=2), plot=FALSE)
plot(va$svario$u, va$svario$v, xlab="distance (m)", ylab="semivariance")

# fit exponential model to the empirical variogram
fit <- fitsvariog(vario=va, ini.cov.pars=c(0.5,20), nugget=8, max.dist=30,
                  plot = FALSE)
fit$param
lines(fit$fit)
data(sim03)
# compute variogram
va3 <- svariog(X=sim03, uvec=distlag(dist=sim03$coord, dmin=1, 
  distance.lag=2), plot=FALSE)

plot(va$svario$u, va$svario$v, xlab="distance (m)", ylab="semivariance")
points(va3$svario$u, va3$svario$v, col="red")
legend("bottomright", legend=c("625 individuals", "100 individuals"),
  col=c("black", "red"), bty="n", lty=0, pch=1)

This illustrates that the spatial signal becomes more erratic when the number of sampled individuals decreases: sampling effort is a very important issue when it comes to spatial analyses.

data(sim02)
va2 <- svariog(X=sim02, plot=FALSE)
plot(va2$svario$u, va2$svario$v, xlab="distance (m)", ylab="semivariance")

The variogram shows that there is no spatial genetic structure in dataset sim02.

Wagner

This dataset is the toy example used in Wagner et al. [-@wagner2005variogram] p. 1750-1752.

data(Wagner)

count <- genocount(X=Wagner)
count
mat <- genoweight(X=Wagner,genotypes=count$vec)
mat

Individuals 1 and 2 are similar and thus the weight of this couple is 0.

If we compute the variogram for genetic diversity and look at the distance classes, we see that the third distance class corresponding to a lag of 3 distance units is omitted. This is because it involves only one data pair:

wa <- varioWeight(X=Wagner, weights=mat, uvec=c(1,2,3))
wa$svario$u # corresponds distance r in Wagner et al 2005 p 1751
wa$svario$gamma # raw semivariances corrsponding to Hhat(r) 
#in Wagner et al 2005 p 1751
wa$svario$n # corresponds distance nr in Wagner et al 2005 p 1751

Note that varioWeight issues some warning messages because some points are co-located. Finally, note also that wa$svario$v is the weighted semivariance for geneotypic diversity and should not be mistaken for the values reported for the molecular variance in Wagner et al (2005) p. 1752.

Aknowledgements

The INRA department EFPA provided financial support to a prehistoric version of the package ggene. We further benefited from the financial support of the INRA métaprogramme SMaCH (Sustainable Management of Crop Health) through the project COPACABANA.

References



Try the ggene package in your browser

Any scripts or data that you put into this service are public.

ggene documentation built on May 2, 2019, 5:54 p.m.