knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(fishCompTools)
This document will walk through how to estimate composition of a mixture of all ad-intact fish by PBT and GSI groups using maximum likelihood estimation. if you do not separate wild origin fish by GSI assignments, you can create a dummy variable that gives all fish (including PBT-assigned fish) the same GSI assignment.
A scenario of all ad-intact fish may be encountered for many reasons, including:
This document will work through a scenario where the samples are grouped into three strata.
First, we will generate some data using the built in data-simulation function.
set.seed(130) #make generated data reproducible # define the unknown "true" proportions at which fish in each # PBT group assign to the different GSI groups pbtGSImat <- matrix(c(.1, .8, .1,.8, .1, .1,.1, .1, .8), nrow = 3, ncol = 3, byrow = TRUE) # now generate three strata from identical parameters multStratData <- data.frame() for(i in 1:3){ tempData <- generatePBTGSIdata(sampRate = .2, censusSize = 3000, relSizePBTgroups = c(1,2,3), tagRates = c(.8, .85,.9), physTagRates = 0, true_clipped = 0, true_noclip_H = .4, true_wild = .6, relSizeGSIgroups = c(1,2,1), PBT_GSI_calls = pbtGSImat) trapData <- tempData[[1]] trapData$Strata <- i multStratData <- rbind(multStratData, trapData) } tags <- tempData[[2]] # get the tag rates for the PBT groups multStratData$GSI <- paste0("GSIgroup_", multStratData$GSI) head(multStratData)
Now we have a dataframe named multStratData
with a column describing whether a fish is ad-clipped or ad-intact (values of AD and AI, respectively), a column denoting if the fish has a physical tag that identifies it as hatchery origin (ignore this column for this vignette, the data we generated has no physical tags), a column with the PBT assignment (Unassigned means it was attempted to be assigned, but did not have a match), a column with GSI assignments, and a column giving the strata that sample belongs to.
head(tags)
We also have a dataframe named tags
with the names of the PBT groups (matching the labels in the GenParentHatchery column) and the tag rates (successful genotyping rates) for those groups.
The likelihood function is defined in the package, and a wrapper in the package calls R's built in optim
function to maximize the likelihood of the parameters within each strata. There is also a gradient function for this likelihood to facilitate usage of gradient based methods. I show here my recommended way of running the optimization. It may occasionally cause R to throw an error, in which case you can troubleshoot (adjusting some of the option for optim can sometimes help), or use the Nelder-Mead
method, which probably won't give you as good of an estimate, but is more robust. You could also consider using the accounting approach (scobi_deux
) or the Bayesian approach (MCpbt
) defined in this package.
#the control option is passed to optim, telling it to run up to 10000 iterations #the gr option is passed to optim, telling it to use params_grad as the gradient function #the lower option is passed to optim, telling it to use 10^-12 as the minimum value to try for any parameter mlePointEstimates <- MLEwrapper(multStratData, tags = tags, GSIcol = "GSI", PBTcol = "GenParentHatchery", strataCol = "Strata", adFinCol = "AdClip", AI = TRUE, optimMethod = "L-BFGS-B", variableCols = NULL, gr = params_grad, lower=10^-12, control = list(maxit=10000))
Now we have the point estimates of composition for each strata. The variable mlePointEstimates
is a list with one item for each strata, and the items have identical structure. Let's look at one strata and walk through the structure of the results.
print(mlePointEstimates[[1]])
You can see each strata is itself a list, with four items. The piTot
item is a vector that gives the estimated proportions of the mixture that are hatchery fish belonging to the three PBT groups and that are wild fish belonging to the three GSI groups.
The piGSI
item is a matrix that shows the estimated proportions of each group (rows) by observed GSI assignment (columns). The wild origin groups are all fixed with 100% of the group assigning to the given GSI category.
If we specified the column name of a categorical variable when we called MLEwrapper
, the piVar
item would be a matrix that has the estimated proportions of each group (rows) by the different categories of the variable (columns). Since we did not specify a variable, this entry is a blank matrix.
The strataName
item is the value of the name of the strata that this entry was estimated from.
Now we have the estimates as proportions, if we have an estimate of the total population size for each strata, we can simply multiply the size by piTot
for each strata, and this will yield estimates for the number of fish in each category within each strata. To obtain an estimate across all strata, we can then simply sum them up as appropriate.
For example, if we estimated that there were 3000 fish in strata 1 and 2, but only 1500 fish in strata three, we can calculate the totals in each strata with the following:
#let's use rbind to make a matrix holding our estimates for each strata # each row will be a strata # note that if different groups are present in different strata, # or if the order of the groups in piTot is different in different strata, the below code will # not work. This is written to explain, not be efficient or perfectly versatile. strataComp <- rbind( mlePointEstimates[[1]]$piTot * 3000, mlePointEstimates[[2]]$piTot * 3000, mlePointEstimates[[3]]$piTot * 1500 ) print(strataComp)
Now, we have the number of fish in each strata, and to get the number of fish in each category overall, we just have to add the columns.
colSums(strataComp)
To get confidence intervals for these estimates, we can settle for non-parametric bootstrapping. This is computationally intensive, so it may take a while depending on how much data you have. We will set up a simple for
loop to perform the bootstrapping, and we will save the estiamtes from each iteration. In this example, we are only interested in piTot
, but if you were interested in the other estimates, you would want to save them as well. We are also only going to generate confidence intervals for all strata combined, but you could save the results for each strata separately if you wanted a confidence interval for each strata.
# the number of bootstraps we will perform bootRep <- 50 # NOTE: this number is set low for the vignette to run quickly. You will want to perform many more repetitions. # a data structure to save the estimates #each row is an estimate, each column is a group bootPiTot <- matrix(0, nrow = bootRep, ncol = 6) colnames(bootPiTot) <- colnames(strataComp) #pulling the group names from strataComp #separate data by strata dataByStrata <- list() for(s in unique(multStratData$Strata)){ dataByStrata[[as.character(s)]] <- multStratData[multStratData$Strata == s,] } for(b in 1:bootRep){ #resample each strata separately bootData <- data.frame() for(d in dataByStrata){ bootData <- rbind(bootData, d[sample(1:nrow(d), nrow(d), replace = TRUE),]) } #get estimates for each strata with resampled data bootEst <- MLEwrapper(bootData, tags = tags, GSIcol = "GSI", PBTcol = "GenParentHatchery", strataCol = "Strata", adFinCol = "AdClip", AI = TRUE, optimMethod = "L-BFGS-B", variableCols = NULL, gr = params_grad, lower=10^-12, control = list(maxit=10000)) #now multiply be population size bootStrataComp <- list( bootEst[[1]]$piTot * 3000, bootEst[[2]]$piTot * 3000, bootEst[[3]]$piTot * 1500 ) #now sum up across strata and add to bootPiTot for(g in colnames(bootPiTot)){ tempSum <- 0 for(strat in bootStrataComp){ # removign NA in case group does not exist in this resample tempSum <- sum(tempSum, strat[g], na.rm = TRUE) } bootPiTot[b,g] <- tempSum } } #now, let's calculate a 90% CI apply(bootPiTot, 2, quantile, c(.05, .95))
Let's say you also want to determine that composition of each group according to one particular categorical variable. Suppose you want to estiamte the proportions of each group that are female.
First, we get some data, following the same scenario as above:
set.seed(13) #make generated data reproducible # define the unknown "true" proportions at which fish in each # PBT group assign to the different GSI groups pbtGSImat <- matrix(c(.1, .8, .1,.8, .1, .1,.1, .1, .8), nrow = 3, ncol = 3, byrow = TRUE) #define the "true" proportions of each group that are male and female # we will have a moderate difference between the PBT and wild groups varMat <- list( matrix(c(rep(c(.4,.6), 3), rep(c(.6,.4), 3)), nrow = 6, ncol = 2, byrow = TRUE) ) # now generate three strata from identical parameters multStratData <- data.frame() for(i in 1:3){ tempData <- generatePBTGSIdata(sampRate = .2, censusSize = 3000, relSizePBTgroups = c(1,2,3), tagRates = c(.8, .85,.9), physTagRates = 0, true_clipped = 0, true_noclip_H = .4, true_wild = .6, relSizeGSIgroups = c(1,2,1), PBT_GSI_calls = pbtGSImat, varMatList = varMat) trapData <- tempData[[1]] trapData$Strata <- i multStratData <- rbind(multStratData, trapData) } tags <- tempData[[2]] # get the tag rates for the PBT groups multStratData$GSI <- paste0("GSIgroup_", multStratData$GSI) multStratData$Var1[multStratData$Var1 == 1] <- "F" multStratData$Var1[multStratData$Var1 == 2] <- "M" colnames(multStratData)[colnames(multStratData) == "Var1"] <- "Sex" head(multStratData)
Then, we use the MLEwrapper
function as before, but we specify the column name of the categorical variable with the variableCols =
option. Notice that here we are using a different gradient function than before.
#the control option is passed to optim, telling it to run up to 10000 iterations #the gr option is passed to optim, telling it to use params_grad as the gradient function #the lower option is passed to optim, telling it to use 10^-12 as the minimum value to try for any parameter withVarEstimates <- MLEwrapper(multStratData, tags = tags, GSIcol = "GSI", PBTcol = "GenParentHatchery", strataCol = "Strata", adFinCol = "AdClip", AI = TRUE, optimMethod = "L-BFGS-B", variableCols = "Sex", gr=params_grad_var, lower = 10^-12, control = list(maxit=10000)) ## now look at the first strata print(withVarEstimates[[1]])
Now, we have the estimated the proportions male and female for each group and each strata as the piVar
items.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.