knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(GeoOrigins)
This vignette walks the reader through the GeoOrigins package using three published datasets. The goal of this vignette is accompany the first manuscript written on this method and also provide the user with the code and tools to use this packages main functionality. The main goal of the method is to provenance specimens of unknown origin to their most likely location of origin without using a priori defined groups. The methods secondary function is to use the leave-one-out correct cross validation procedure to identify trait boundaries.
The vignette is split into three sections to outline the methods and code used to generate the results in the accompanying paper for each of the three datasets used there. The accompanying paper with this method presents the results generated with Pearson's r, examples in sections 1 and 2 here are set to present the results generated with Spearman's rho for comparison; example 3 presents a complex scenario where traits and geographic distances are so highly correlated that the predicted position of the specimen can be smaller than the resolution of the analyses, as such using Pearson's r represents a more efficient assessment of possible location.
The first dataset we will examine comes from 48 specimens of R. praetor.
For this dataset we will work with the raw shape data. Procrustes distances must first be calculated among the specimens use in the ProcDistanceTable
function, which is an extension of the procdist
function from the shapes
package. This function is slow as it needs to calculate the value pairwise
``` {r Calculating Rpraetor Distance matrix} RatDistMat <- ProcDistanceTable(Rpraetor$LMs) colnames(RatDistMat) <- rownames(RatDistMat) <- rownames(Rpraetor$Lat.Long)
With a distance matrix it is now possible to identify the minimum required *r* value to provenance each specimen in the reference dataset. This can be carried out with the `IDbyDistanceDistInputCCV` as follows: ``` {r Calculating correlation threshold for Rpraetor} rThres <- IDbyDistanceDistInputCCV(LatLongs = Rpraetor$Lat.Long, DistDataMat = RatDistMat, Verbose = TRUE, ProvConfidence = .95, Method = "Spearman", PrintProg = FALSE)
To demonstrate the provenancing of one of these specimens in a leave-one-out fashion we can now take this r value and apply it to the normal IDbyDistanceDistInput
function. As this method is largely visually based with a plotted map output, it is first necessary to define the plotting region of the map.
Here we set the region we wish to plot by setting identifying the minimum and maximum range of the reference specimens and then expanding that range by a set value Rang.Exp
. This can be done as follows:
``` {r Setting spatial grid range} Range.Exp <- .5 Long.Range <- c(floor(min(Rpraetor$Lat.Long$Long))-Range.Exp,ceiling(max(Rpraetor$Lat.Long$Long)+Range.Exp)) Lat.Range <- c(floor(min(Rpraetor$Lat.Long$Lat))-Range.Exp,ceiling(max(Rpraetor$Lat.Long$Lat)+Range.Exp))
We can then use this range information in the provenancing function. To run the provenancing function as though it was on a specimen of truly unknown origin we can take the distances to one individual in or reference dataset and run that through the method (remembering to also remove that specimen from the `Lat.Long` information). We also need to define the number of gird points the method should explore along each axis (in the order latitude longitude). As we know the true origin of this specimen we can then plot the true latitude longitude values of that specimen along with the predicted region of provinence (note this process can take some time depending on the level of resolution put into the `RangeSamp` argument): ``` {r Provenancing using a distance input method with Rpraetor} R.Samp <- c(12, 42) IDbyDistanceDistInput(LatLongs = Rpraetor$Lat.Long[-1,], DistDataVec = RatDistMat[-1,1], LongRange = Long.Range, LatRange = Lat.Range, RangeSamp = R.Samp, Verbose = FALSE, Validate = FALSE, PlotValCor = rThres$`Provenancing.Correlation.95%.Confidence`, Method = 'Spearman', PlotProv = TRUE) points(x = Rpraetor$Lat.Long$Long[1], y=Rpraetor$Lat.Long$Lat[1], col='blue', pch=16)
We can also apply this directly to the data by inputting the data for an unknown specimen in the TargetData
argument. If shape data is being used then the function can be set to apply a full Procrustes distance (i.e. pairwise calculated distances where each specimen undergoes a Procrustes rotation, using the proc.dist
function of the shapes
package) by setting the ShapeData
argument to TRUE, providing the number of dimensions (2 or 3) in the ShapeDim
argument and finally setting the DistMethod
argument to "Proc"
. As the landmark data provided here is in array format it first needs to be converted to a matrix for use in this function. This achieves the same result as previously displayed.
`` {r Provenancing using raw shape data input method with Rpraetor}
RpraetorDataMat <- Array2Mat(Rpraetor$LMs)
IDbyDistanceRawData(LatLongs = Rpraetor$Lat.Long[-1,],
TargetData = RpraetorDataMat[1,],
RefData = RpraetorDataMat[-1,],
ShapeData = TRUE,
ShapeDim = 2,
DistMethod = "Proc",
LongRange = Long.Range,
LatRange = Lat.Range,
RangeSamp = R.Samp,
Verbose = FALSE,
Validate = FALSE,
PlotValCor = rThres$
Provenancing.Correlation.95%.Confidence`,
Method = 'Spearman',
PlotProv = TRUE)
points(x = Rpraetor$Lat.Long$Long[1], y=Rpraetor$Lat.Long$Lat[1], col='blue', pch=16)
We can then use the `BoundaryFinder` function to extend this method to identify trait boundaries. This can take some time as it needs to construct a provenancing map for every specimen in the reference dataset and compile the data from that. Note here `IgnorePrompts` is set to TRUE for ease of use in the vignette, but the default is FALSE and will ask you if the datadump location is correct and if the specimen naming sequence is appropriate: ``` {r Trait boundary finding with Rpraetor data} LowResRsamp.Rp <- c(7,20) Boundaryfinding <- BoundaryFinder(LatLongs = Rpraetor$Lat.Long, RefDistMat = RatDistMat, LongRange = Long.Range, LatRange = Lat.Range, RangeSamp = LowResRsamp.Rp, PlotValCor = rThres$`Provenancing.Correlation.95%.Confidence`, Method = 'Spearman', ExpandMap = c(0,0), RefIDs = rownames(Rpraetor$Lat.Long), DataDump = FALSE, IgnorePrompts = TRUE)
The BoundaryFinder
function also outputs the percentage the provenancing region overlaps with the region defined by all the reference collection. This can be examined as a histogram:
``` {r Plotting the identification range percentage with Rpraetor trait boundary results} hist(as.numeric(Boundaryfinding$ProvenanceRange[,2]), breaks=10, xlab = "percentage of distribution returned")
The boundary finding data can be replotted either from the datadump path or from the array generated from the `BoundaryFinder` function if the data has not been dumped. Here for simplicity we will use the previously generated array. The `PlotBoundaries` function allows for changing the colour and plotting of the map. ``` {r Plotting Rpraetor boundary data 1} PlotBoundaries(PlotValCor = rThres$`Provenancing.Correlation.95%.Confidence`, DataDump = FALSE, RawCorArray = Boundaryfinding$RawCorData, plotLong = colnames(Boundaryfinding$RawCorData), plotLat = rownames(Boundaryfinding$RawCorData), LatLongs = Rpraetor$Lat.Long, TileSize = 1, BoundaryHue = c(1,1))
Here the TileSize
setting doesn't match up with the resolution of the grid sampling that was used for generating this trait boundary object. Here we can adjust the TileSize setting to attempt to account for this. We can also change the BoundaryHue
argument to change the colour of the plotted boundaries.
`` {r Plotting Rpraetor boundary data 2}
PlotBoundaries(PlotValCor = rThres$
Provenancing.Correlation.95%.Confidence`,
DataDump = FALSE,
RawCorArray = Boundaryfinding$RawCorData,
plotLong = colnames(Boundaryfinding$RawCorData),
plotLat = rownames(Boundaryfinding$RawCorData),
LatLongs = Rpraetor$Lat.Long,
TileSize = 3,
BoundaryHue = c(0.5,1))
*Microtus arvalis* the geographic structure of the dental morphology of Common and Orkney voles ----------------------------------------------------------------------------------------------- The second dataset we will examine comes from 553 specimens of *M. arvalis*. For this dataset we will work with pregenerated boundary correlation arrays. The original landmark is available from the original publication Cucchi et al 2014, but a procrustes distances matrix is also provided here for the r threshold calculation. Due to pakage dataset size limitations the following trait boundary correlation arrays provided here are at a lower resolution (i.e. the spatial grid has been sampled at a lower level by setting the `RangeSamp` argument to a lower values) than those used to build the figures in the main body of the paper. The level of resolution used in the publication can be achieved by setting the `RangeSamp` value to `c(30,36)` for the full species range and `c(30,20)` for the Orkney only range. ``` {r Calculating correlation threshold and plotting trait boundaries for Marvalis} MarvalisrThres <- IDbyDistanceDistInputCCV(LatLongs = Marvalis$Info[,3:4], DistDataMat = Marvalis$Proc.Dist, Verbose = TRUE, ProvConfidence = .95, Method = 'Spearman', PrintProg = FALSE) PlotBoundaries(PlotValCor = MarvalisrThres$`Provenancing.Correlation.95%.Confidence`, DataDump = FALSE, RawCorArray = Marvalis$Total.Boundary$RawCorData, plotLong = colnames(Marvalis$Total.Boundary$RawCorData), plotLat = rownames(Marvalis$Total.Boundary$RawCorData), LatLongs = Marvalis$Info[,3:4], TileSize = 2.5, BoundaryHue = c(0.5,1))
When applying it to just the Orkney range we first need to select out just the Orkney specimens. This can be done with the information provided in the Marvalis$Info
object.
`` {r Plotting trait boundaries for Orkney Marvalis}
OrkneySpecimens <- which(Marvalis$Info$Geographic.Division=='Orkney Is.')
OrkMarvalisrThres <- IDbyDistanceDistInputCCV(LatLongs = Marvalis$Info[OrkneySpecimens,3:4],
DistDataMat = Marvalis$Proc.Dist[OrkneySpecimens,OrkneySpecimens],
Verbose = TRUE,
ProvConfidence = .95,
Method = 'Spearman',
PrintProg = FALSE)
PlotBoundaries(PlotValCor = OrkMarvalisrThres$
Provenancing.Correlation.95%.Confidence`,
DataDump = FALSE,
RawCorArray = Marvalis$Orkney.Boundary$RawCorData,
plotLong = colnames(Marvalis$Orkney.Boundary$RawCorData),
plotLat = rownames(Marvalis$Orkney.Boundary$RawCorData),
LatLongs = Marvalis$Info[OrkneySpecimens,3:4],
TileSize = 2.1,
BoundaryHue = c(0.5,1))
*Fringilla teydea* the geographic structure of the songs of Tenerifian blue chaffinch ------------------------------------------------------------------------------------- The third dataset we will examine comes from 116 specimens of *F. teydea*. Again we will work with pre generated boundary correlation arrays and the dynamic time-warping dissimilarity matrix, which serves as a distance matrix. As with the other datasets provided this is at a lower resolution than provided in the figures of the paper and to achieve the same images requires upping the `Rang.Samp` values to `c(36, 34)`. Note that this difference in resolution can influence the position of the trait boundary and at very low resolutions (as is the case here) the boundary will inevitably move to one side or the other of the original range. This is observable in the NW population and corresponding boundary. Therefore it might be desirable to run a first low resolution pass followed by a higher resolution investigation afterwards. ``` {r Calculating correlation threshold and plotting trait boundaries for Fteydea} FteydeaThres <- IDbyDistanceDistInputCCV(LatLongs = Fteydea$Info[,2:3], DistDataMat = sqrt(Fteydea$SongDisMat), Verbose = TRUE, ProvConfidence = .95, Method = 'Spearman', PrintProg = FALSE) PlotBoundaries(PlotValCor = FteydeaThres[[1]], DataDump = FALSE, RawCorArray = Fteydea$Total.Boundary$RawCorData, plotLong = colnames(Fteydea$Total.Boundary$RawCorData), plotLat = rownames(Fteydea$Total.Boundary$RawCorData), LatLongs = Fteydea$Info[,2:3], TileSize = 2.5, BoundaryHue = c(0.5,1))
A possible occurrence with this method is that when the predicted location is highly precise (e.g. the results that can be achieved by log transforming the F. teydea song dissimilarities) then the resolution of the provenancing or even boundary finding grid might be too low to highlight the regions that might be the likely origin or indeed form a trait boundary. This happens because the region that is identified as the likely origin is effectively one grid squares/pixels in size or even smaller. The end result is that the correlation value for the the large grid square may not meet the high correlation value threshold and as a result no region is returned as a likely origin. In some instances it is possible for this to be confused with a negative result and could be misinterpreted as the unknown trait not being sufficiently represented in the reference dataset to predict a likely origin. An example of this is provided when examining the NE sub region of the blue chaffinch data. The identified regions in some instances are so specific that at low resolution a predicted location is not returned.
Note that log transforming the dissimilarity matrix results in the diagonal of 0s (where the dissimilarity between the same individual is reported) becomes -Inf
so this should be replaced with 0s again.
Note also that this dataset is so highly spatially correlated that setting the Threshold R with Spearman will require the grid sampling to be prohibitively high. Therefore for the purposes of this vignette example we will go back to using Pearon's method as in the accompanying paper. This issue can be easily identified by examining the output of IDbyDistanceDistInputCCV to see what the threshold rho is; if it is close to 1 then consider one of the following: increasing the confidence required to identify the region of identification; increasing the grid sampling; or user defining the rho threshold and ensure that a description of that this was carried out is included in your description of methods.
``` {r Examples of working with fine resolution and potential issues with Fteydea} LogFteydeaDisMat <- log(Fteydea$SongDisMat) LogFteydeaDisMat[which(is.infinite(LogFteydeaDisMat))] <- 0
NE <- which(Fteydea$Info$Lat>28.4)
NEFtInfo <- Fteydea$Info[NE,]
NEDisMat <- LogFteydeaDisMat[NE,NE]
NEFtrThres <- IDbyDistanceDistInputCCV(LatLongs = NEFtInfo[,2:3],
DistDataMat = NEDisMat,
Verbose = TRUE,
ProvConfidence = .99,
PrintProg = FALSE,
Method = 'Pearson')
Long.Range.Exp <- .005
Lat.Range.Exp <- .001
NEFt.Long.Range <- c(min(NEFtInfo$Long)-Long.Range.Exp, max(NEFtInfo$Long)+Long.Range.Exp)
NEFt.Lat.Range <- c(min(NEFtInfo$Lat)-Lat.Range.Exp, max(NEFtInfo$Lat)+Lat.Range.Exp)
NEFtR.Samp <- c(5)
IDbyDistanceDistInput(LatLongs = NEFtInfo[-1,2:3],
DistDataVec = NEDisMat[-1,1],
LongRange = NEFt.Long.Range,
LatRange = NEFt.Lat.Range,
RangeSamp = NEFtR.Samp,
Verbose = FALSE,
Validate = FALSE,
PlotValCor = NEFtrThres$Provenancing.Correlation.99%.Confidence
,
PlotProv = TRUE,
Method = 'Pearson')
points(x = NEFtInfo$Long[1], y=NEFtInfo$Lat[1], col='blue', pch=16)
In comparison if we increase the `RangeSamp` value such that a number of grid squares reach the r threshold but not enough to sufficiently return a polygon of a likely origin region the function returns a map with those grid squares highlighted and the following warning message: ``` {r Addressing potential issues with high precision datasets with examples from Fteydea 1} NEFtR.Samp <- c(15) IDbyDistanceDistInput(LatLongs = NEFtInfo[-1,2:3], DistDataVec = NEDisMat[-1,1], LongRange = NEFt.Long.Range, LatRange = NEFt.Lat.Range, RangeSamp = NEFtR.Samp, Verbose = FALSE, Validate = FALSE, PlotValCor = NEFtrThres$`Provenancing.Correlation.99%.Confidence`, PlotProv = TRUE, Method = 'Pearson') points(x = NEFtInfo$Long[1], y=NEFtInfo$Lat[1], col='blue', pch=16)
If we increase the grid sampling further the function provides a polygon of the likely region, but this too will be influenced by sampling resolution:
`` {r Addressing potential issues with high precision datasets with examples from Fteydea 2}
NEFtR.Samp <- c(25)
IDbyDistanceDistInput(LatLongs = NEFtInfo[-1,2:3],
DistDataVec = NEDisMat[-1,1],
LongRange = NEFt.Long.Range,
LatRange = NEFt.Lat.Range,
RangeSamp = NEFtR.Samp,
Verbose = FALSE,
Validate = FALSE,
PlotValCor = NEFtrThres$
Provenancing.Correlation.99%.Confidence`,
PlotProv = TRUE,
Method = 'Pearson')
points(x = NEFtInfo$Long[1], y=NEFtInfo$Lat[1], col='blue', pch=16)
If we increase the grid sampling further again the function provides a more accurate polygon of the likely region (Note however that computation time increases and can become prohibitive): ``` {r Addressing potential issues with high precision datasets with examples from Fteydea 3} NEFtR.Samp <- c(50) IDbyDistanceDistInput(LatLongs = NEFtInfo[-1,2:3], DistDataVec = NEDisMat[-1,1], LongRange = NEFt.Long.Range, LatRange = NEFt.Lat.Range, RangeSamp = NEFtR.Samp, Verbose = FALSE, Validate = FALSE, PlotValCor = NEFtrThres[[1]], PlotProv = TRUE, Method = 'Pearson') points(x = NEFtInfo$Long[1], y=NEFtInfo$Lat[1], col='blue', pch=16)
This issue can also occur when using the BoundaryFinder
function. If we set the grid sampling too low and the likely origin location for some reference specimens is poorly defined then there is no correlation data from which a boundary can be identified. However, also note that if the results are extremely precise and individuals are identified to a very specific region then there may not be enough overlap in boundaries to confidently call a trait boundary; this is effectively because the level of precision is so high that each individual has its own boundary and a general trend cannot be extrapolated.
Also note that with the BoundaryFinder
function the overlap between the predicted origin and the reference dataset distribution is returned, but at low resolutions there is no polygon to calculate this value from so no percentage is returned. At low resolution it is also possible that there may be no overlap between these two polygons (e.g. if the only grid square that does meet the threshold falls outside the reference dataset distribution) and as a result a warning will be generated saying the polygons do not intersect. See the below examples:
`` {r Addressing potential issues with high precision datasets with examples from Fteydea 4}
NEFtR.Samp <- c(5)
NEFt.Boundaryfinding <- BoundaryFinder(LatLongs = NEFtInfo[,2:3],
RefDistMat = NEDisMat,
DataDump = FALSE,
LongRange = NEFt.Long.Range,
LatRange = NEFt.Lat.Range,
RangeSamp = NEFtR.Samp,
PlotValCor = NEFtrThres$
Provenancing.Correlation.99%.Confidence`,
RefIDs = NEFtInfo$ID,
IgnorePrompts = TRUE,
Method = 'Pearson')
And again with a higher resolution: ``` {r Addressing potential issues with high precision datasets with examples from Fteydea 5} NEFtR.Samp <- c(20,30) NEFt.MidRes.Boundaryfinding <- BoundaryFinder(LatLongs = NEFtInfo[,2:3], RefDistMat = NEDisMat, DataDump = FALSE, LongRange = NEFt.Long.Range, LatRange = NEFt.Lat.Range, RangeSamp = NEFtR.Samp, PlotValCor = NEFtrThres$`Provenancing.Correlation.99%.Confidence`, RefIDs = NEFtInfo$ID, IgnorePrompts = TRUE, Method = 'Pearson')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.