make.mark.model: Create a MARK model for analysis

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/make.mark.model.R

Description

Creates a MARK model object that contains a MARK input file with PIMS and design matrix specific to the data and model structure and formulae specified for the model parameters.

Usage

1
2
3
4
5
6
7
make.mark.model(data, ddl, parameters = list(), title = "",
  model.name = NULL, initial = NULL, call = NULL,
  default.fixed = TRUE, options = NULL, profile.int = FALSE,
  chat = NULL, simplify = TRUE, input.links = NULL,
  parm.specific = FALSE, mlogit0 = FALSE, hessian = FALSE,
  accumulate = TRUE, icvalues = NULL, wrap = TRUE, nodes = 101,
  useddl = FALSE, check.model = FALSE)

Arguments

data

Data list resulting from function process.data

ddl

Design data list from function make.design.data

parameters

List of parameter formula specifications

title

Title for the analysis (optional)

model.name

Model name to override default name (optional)

initial

Vector of named or unnamed initial values for beta parameters or previously run model (optional)

call

Pass function call when this function is called from another function (e.g.mark) (internal use)

default.fixed

if TRUE, real parameters for which the design data have been deleted are fixed to default values

options

character string of options for Proc Estimate statement in MARK .inp file

profile.int

if TRUE, requests MARK to compute profile intervals

chat

pre-specified value for chat used by MARK in calculations of model output

simplify

if FALSE, does not simplify PIM structure

input.links

specifies set of link functions for parameters with non-simplified structure

parm.specific

if TRUE, forces a link to be specified for each parameter

mlogit0

if TRUE, any real parameter that is fixed to 0 and has an mlogit link will have its link changed to logit so it can be simplified

hessian

if TRUE specifies to MARK to use hessian rather than second partial matrix

accumulate

if TRUE accumulate like data values into frequencies

icvalues

numeric vector of individual covariate values for computation of real values

wrap

if TRUE, data lines are wrapped to be length 80; if length of a row is not a problem set to FALSE and it will run faster

nodes

number of integration nodes for individual random effects (min 15, max 505, default 101)

useddl

If TRUE and there are no missing rows or parameters (deleted) then it will use ddl in place of full.ddl that is created internally.

check.model

if TRUE, code does an internal consistency check between PIMs and design data when making model.

Details

This function is called by mark to create the model but it can be called directly to create but not run the model. All of the arguments have default values except for the first 2 that specify the processed data list (data) and the design data list (ddl). If only these 2 arguments are specified default models are used for the parameters. For example, following with the example from process.data and make.design.data, the default model can be created with:

mymodel=make.mark.model(proc.example.data,ddl)

The call to make.mark.model creates a model object but does not do the analysis. The function returns a list that includes several fields including a design matrix and the MARK input file that will be used with MARK.EXE to run the analysis from function run.mark.model. The following shows the names of the list elements in mymodel:

1
2
3
4
 names(mymodel) [1] "data" "model" "title" "model.name"
"links" [6] "mixtures" "call" "parameters" "input" "number.of.groups" [11]
"group.labels" "nocc" "begin.time" "covariates" "fixed" [16] "design.matrix"
"pims" "design.data" "strata.labels" "mlogit.list" [21] "simplify" 

The list is defined to be a mark object which is done by assigning a class vector to the list. The classes for an R object can be viewed with the class function as shown below:

1
 class(mymodel) [1] "mark" "CJS" 

Each MARK model has 2 class values. The first identifies it as a mark object and the second identifies the type of mark analysis, which is the default "CJS" (recaptures only) in this case. The use of the class feature has advantages in using generic functions and identifying types of objects. An object of class mark is defined in more detail in function mark.

To fit non-trivial models it is necessary to understand the remaining calling arguments of make.mark.model and R formula notation. The model formulae are specified with the calling argument parameters. It uses a similar list structure as the parameters argument in make.design.data. It expects to get a list with elements named to match the parameters in the particular analysis (e.g., Phi and p in CJS) and each list element is a list, so it is a list of lists). For each parameter, the possible list elements are formula, link, fixed, component, component.name, remove.intercept. In addition, for closed capture models and robust design model, the element share is included in the list for p (capture probabilities) and GammaDoublePrime (respectively) to indicate whether the model is shared (share=TRUE) or not-shared (the default) (share=FALSE) with c (recapture probabilities) and GammaPrime respectively.

formula specifies the model for the parameter using R formula notation. An R formula is denoted with a ~ followed by variables in an equation format possibly using the + , *, and : operators. For example, ~sex+age is an additive model with the main effects of sex and age. Whereas, ~sex*age includes the main effects and the interaction and it is equivalent to the formula specified as ~sex+age+sex:age where sex:age is the interaction term. The model ~age+sex:age is slightly different in that the main effect for sex is dropped which means that intercept of the age effect is common among the sexes but the age pattern could vary between the sexes. The model ~sex*Age which is equivalent to ~sex + Age + sex:Age has only 4 parameters and specifies a linear trend with age and both the intercept and slope vary by sex. One additional operator that can be useful is I() which allows computation of variables on the fly. For example, the addition of the Agesq variable in the design data (as described above) can be avoided by using the notation ~Age + I(Age^2) which specifies use of a linear and quadratic effect for age. Note that specifying the model ~age + I(age^2) would be incorrect and would create an error because age is a factor variable whereas Age is not.

As an example, consider developing a model in which Phi varies by age and p follows a linear trend over time. This model could be specified and run as follows:

1
2
3
4
 p.Time=list(formula=~Time) Phi.age=list(formula=~age)
Model.p.Time.Phi.age=make.mark.model(proc.example.data,ddl,
parameters=list(Phi=Phi.age,p=p.Time))
Model.p.Time.Phi.age=run.mark.model(Model.p.Time.Phi.age) 

The first 2 commands define the p and Phi models that are used in the parameter list in the call to make.mark.model. This is a good approach for defining models because it clearly documents the models, the definitions can then be used in many calls to make.mark.model and it will allow a variety of models to be developed quickly by creating different combinations of the parameter models. Using the notation above with the period separating the parameter name and the description (eg., p.Time) gives the further advantage that all possible models can be developed quickly with the functions create.model.list and mark.wrapper.

Model formula can use any field in the design data and any individual covariates defined in data. The restrictions on individual covariates that was present in versions before 1.3 have now been removed. You can now use interactions of individual covariates with all design data covariates and products of individual covariates. You can specify interactions of individual covariates and factor variables in the design data with the formula notation. For example, ~region*x1 describes a model in which the slope of x1 varies by region. Also, ~time*x1 describes a model in which the slope for x1 varied by time; however, there would be only one value of the covariate per animal so this is not a time varying covariate model. Models with time varying covariates are now more easily described with the improvements in version 1.3 but there are restrictions on how the time varying individual covariates are named. An example would be a trap dependence model in which capture probability on occasion i+1 depends on whether they were captured on occasion i. If there are n occasions in a CJS model, the 0/1 (not caught/caught) for occasions 1 to n-1 would be n-1 individual covariates to describe recapture probability for occasions 2 to n. For times 2 to n, a design data field could be defined such that the variable timex is 1 if time==x and 0 otherwise. The time varying covariates must be named with a time suffix on the base name of the covariate. In this example they would be named as x2,. . .,xn and the model could be specified as ~time + x for time variation and a constant trap effect or as ~time + time:x for a trap effect that varied by time. If in the process.data call, the argument begin.time was set to the year 1990, then the variables would have to be named x1991,x1992,... because the first recapture occasion would be 1991. Note that the times are different for different parameters. For example, survival is labeled based on the beginning of the time interval which would be 1990 so the variables should be named appropriately for the parameter model in which they will be used.

In previous versions to handle time varying covariates with a constant effect, it was necessary to use the component feature of the parameter specification to be able to append one or more arbitrary columns to the design matrix. That is no longer required for individual covariates and the component feature was removed in v2.0.8.

There are three other elements of the parameter list that can be useful on occasion. The first is link which specifies the link function for transforming between the beta and real parameters. The possible values are "logit", "log", "identity" and "mlogit(*)" where * is a numeric identifier. The "sin" link is not allowed because all models are specified using a design matrix. The typical default values are assigned to each parameter (eg "logit" for probabilities, "log" for N, and "mlogit" for pent in POPAN), so in most cases it will not be necessary to specify a link function.

The second is fixed which allows real parameters to be set at fixed values. The values for fixed can be either a single value or a list with 5 alternate forms for ease in specifying the fixed parameters. Specifying fixed=value will set all parameters of that type to the specified value. For example, Phi=list(fixed=1) will set all Phi to 1. This can be useful for situations like specifying F in the Burnham/Barker models to all have the same value of 1. Fixed values can also be specified as a list in which values are specified for certain indices, times, ages, cohorts, and groups. The first 3 will be the most useful. The first list format is the most general and flexible but it requires an understanding of the PIM structure and index numbers for the real parameters. For example,

Phi=list(formula=~time, fixed=list(index=c(1,4,7),value=1))

specifies Phi varying by time, but the real parameters 1,4,7 are set to 1. The value field is either a single constant or its length must match the number of indices. For example,

Phi=list(formula=~time, fixed=list(index=c(1,4,7),value=c(1,0,1)))

sets real parameters 1 and 7 to 1 and real parameter 4 to 0. Technically, the index/value format for fixed is not wedded to the parameter type (i.e., values for p can be assigned within Phi list), but for the sake of clarity they should be restricted to fixing real parameters associated with the particular parameter type. The time and age formats for fixed will probably be the most useful. The format fixed=list(time=x, value=y) will set all real parameters (of that type) for time x to value y. For example,

p=list(formula=~time,fixed=list(time=1986,value=1))

sets up time varying capture probability but all values of p for 1986 are set to 1. This can be quite useful to set all values to 0 in years with no sampling (e.g.,

1
fixed=list(time=c(1982,1984,1986), value=0)

). The age, cohort and group formats work in a similar fashion. It is important to recognize that the value you specify for time, age, cohort and group must match the values in the design data list. This is another reason to add binned fields for age, time etc with add.design.data after creating the default design data with make.design.data. Also note that the values for time and cohort are affected by the begin.time argument specified in process.data. Had I not specified begin.time=1980, to set p in the last occasion (1986), the specification would be

p=list(formula=~time,fixed=list(time=7,value=1))

because begin.time defaults to 1. The advantage of the time-, age-, and cohort- formats over the index-format is that it will work regardless of the group definition which can easily be changed by changing the groups argument in process.data. The index-format will be dependent on the group structure because the indexing of the PIMS will most likely change with changes in the group structure.

Parameters can also be fixed at default values by deleting the specific rows of the design data. See make.design.data and material below. The default value for fixing parameters for deleted design data can be changed with the default=value in the parameter list.

Another useful element of the parameter list is the remove.intercept argument. It is set to TRUE to forcefully remove the intercept. In R notation this can be done by specifiying the formula notation ~-1+... but in formula with nested interactions of factor variables and additive factor variables the -1 notation will not remove the intercept. It will simply adjust the column definitions but will keep the same number of columns and the model will be overparameterized. The problem occurs with nested factor variables like tostratum within stratum for multistrata designs (see mstrata). As shown in that example, you can build a formula -1+stratum:tostratum to have transitions that are stratum-specific. If however you also want to add a sex effect and you specify -1+sex+stratum:tostratum it will add 2 columns for sex labelled M and F when in fact you only want to add one column because the intercept is already contained within the stratum:tostratum term. The argument remove.intercept will forcefully remove the intercept but it needs to be able to find a column with all 1's. For example, Psi=list(formula=~sex+stratum:tostratum,remove.intercept=TRUE) will work but Psi=list(formula=~-1+sex+stratum:tostratum,remove.intercept=TRUE) will not work. Also, the -1 notation should be used when there is not an added factor variable because

1
Psi=list(formula=~stratum:tostratum,remove.intercept=TRUE)

will not work because while the stratum:tostratum effectively includes an intercept it is equivalent to using an identity matrix and is not specified as treatment contrast with one of the columns as all 1's.

The final argument of the parameter list is contrast which can be used to change the contrast used for model.matrix. It uses the default if none specified. The form is shown in ?model.matrix.

The argument simplify determines whether the pims are simplified such that only indices for unique and fixed real parameters are used. For example, with an all different PIM structure with CJS with K occasions there are K*(K-1) real parameters for Phi and p. However, if you use simplify=TRUE with the default model of Phi(.)p(.), the pims are re-indexed to be 1 for all the Phi's and 2 for all the p's because there are only 2 unique real parameters for that model. Using simplify can speed analysis markedly and probably should always be used. This was left as an argument only to test that the simplification was working and produced the same likelihood and real parameter estimates with and without simplification. It only adjust the rows of the design matrix and not the columns. There are some restrictions for simplification. Real parameters that are given a fixed value are maintained in the design matrix although it does simplify amongst the fixed parameters. For example, if there are 50 real parameters all fixed to a value of 1 and 30 all fixed to a value of 0, they are reduced to 2 real parameters fixed to 1 and 0. Also, real parameters such as Psi in Multistrata and pent in POPAN that use multinomial logits are not simplified because they must maintain the structure created by the multinomial logit link. All other parameters in those models are simplified. The only downside of simplification is that the labels for real parameters in the MARK output are unreliable because there is no longer a single label for the real parameter because it may represent many different real parameters in the all-different PIM structure. This is not a problem with the labels in R because the real parameter estimates are translated back to the all-different PIM structure with the proper labels.

The argument default.fixed is related to deletion of design data (see make.design.data). If design data are deleted and default.fixed=T the missing real parameters are fixed at a reasonable default to represent structural "zeros". For example, p is set to 0, Phi is set to 1, pent is set to 0, etc. For some parameters there are no reasonable values (e.g., N in POPAN), so not all parameters will have defaults. If a parameter does not have a default or if default.fixed=F then the row in the design matrix for that parameter is all zeroes and its real value will depend on the link function. For example, with "logit" link the real parameter value will be 0.5 and for the log link it will be 1. As long as the inverse link is defined for 0 it will not matter in those cases in which the real parameter is not used because it represents data that do not exist. For example, in a "CJS" model if initial captures (releases) only occur every other occasion, but recaptures (resightings) occurred every occasion, then every other cohort (row) in the PIM would have no data. Those rows (cohorts) could be deleted from the design data and it would not matter if the real parameter was fixed. However, for the case of a Jolly-Seber type model (eg POPAN or Pradel models) in which the likelihood includes a probability for the leading zeroes and first 1 in a capture history (a likelihood component for the first capture of unmaked animals), and groups represent cohorts that enter through time, you must fix the real parameters for the unused portion of the PIM (ie for occasions prior to time of birth for the cohort) such that the estimated probability of observing the structural 0 is 1. This is easily done by setting the pent (probability of entry) to 0 or by setting the probability of capture to 0 or both. In that case if default.fixed=F, the probabilities for all those parameters would be incorrectly set to 0.5 for p and something non-zero but not predetermined for pent because of the multinomial logit. Now it may be possible that the model would correctly estimate these as 0 if the real parameters were kept in the design, but we know what those values are in that case and they need not be estimated. If it is acceptable to set default.fixed=F, the functions such as summary.mark recognize the non-estimated real parameters and they are not shown in the summaries because in essence they do not exist. If default.fixed=T the parameters are displayed with their fixed value and for summary.mark(mymodel,se=TRUE), they are listed as "Fixed".

Missing design data does have implications for specifying formula but only when interactions are specified. With missing design data various factors may not be fully crossed. For example, with 2 factors A and B, each with 2 levels, the data are fully crossed if there are data with values A1&B1, A1&B2, A2&B1 and A2&B2. If data exist for each of the 4 combinations then you can described the interaction model as ~A*B and it will estimate 4 values. However, if data are missing for one of more of the 4 cells then the "interaction" formula should be specified as ~-1+A:B or ~-1+B:A or ~-1+A the combinations with data. An example of this could be a marking program with multiple sites which resighted at all occasions but which only marked at sites on alternating occasions. In that case time is nested within site and time-site interaction models would be specified as ~-1+time:site.

The argument title supplies a character string that is used to label the output. The argument model.name is a descriptor for the model used in the analysis. The code constructs a model name from the formula specified in the call (e.g., Phi(~1)p(~time)) but on occasion the name may be too long or verbose, so it can be over-ridden with the model.name argument.

The final argument initial can be used to provide initial estimates for the beta parameters. It is either 1) a single starting value for each parameter, 2) an unnamed vector of values (one for each parameter), 3) named vector of values, or 4) the name of mark object that has already been run. For cases 3 and 4, the code only uses appropriate initial beta estimates in which the column names of the design matrix (for model) or vector names match the column names in the design matrix of the model to be run. Any remaining beta parameters without an initial value specified are assigned a 0 initial value. If case 4 is used the models must have the same number of rows in the design matrix and thus presumably the same structure. As long as the vector elements are named (#3), the length of the initial vector no longer needs to match the number of parameters in the new model as long as the elements are named. The names can be retrieved either from the column names of the design matrix or from rownames(x$results$beta) where x is the name of the mark object.

options is a character string that is tacked onto the Proc Estimate statement for the MARK .inp file. It can be used to request options such as NoStandDM (to not standardize the design matrix) or SIMANNEAL (to request use of the simulated annealing optimization method) or any existing or new options that can be set on the estimate proc.

Value

model: a MARK object except for the elements output and results. See mark for a detailed description of the list contents.

Author(s)

Jeff Laake

See Also

process.data,make.design.data, run.mark.model mark

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# This example is excluded from testing to reduce package check time
data(dipper)
#
# Process data
#
dipper.processed=process.data(dipper,groups=("sex"))
#
# Create default design data
#
dipper.ddl=make.design.data(dipper.processed)
#
# Add Flood covariates for Phi and p that have different values
#
dipper.ddl$Phi$Flood=0
dipper.ddl$Phi$Flood[dipper.ddl$Phi$time==2 | dipper.ddl$Phi$time==3]=1
dipper.ddl$p$Flood=0
dipper.ddl$p$Flood[dipper.ddl$p$time==3]=1
#
#  Define range of models for Phi
#
Phidot=list(formula=~1)
Phitime=list(formula=~time)
PhiFlood=list(formula=~Flood)
#
#  Define range of models for p
#
pdot=list(formula=~1)
ptime=list(formula=~time)
#
# Make assortment of models
#
dipper.phidot.pdot=make.mark.model(dipper.processed,dipper.ddl,
  parameters=list(Phi=Phidot,p=pdot))
dipper.phidot.ptime=make.mark.model(dipper.processed,dipper.ddl,
  parameters=list(Phi=Phidot,p=ptime))
dipper.phiFlood.pdot=make.mark.model(dipper.processed,dipper.ddl,
  parameters=list(Phi=PhiFlood, p=pdot))

RMark documentation built on Nov. 7, 2019, 1:06 a.m.

Related to make.mark.model in RMark...