View source: R/make.mark.model.R
make.mark.model | R Documentation |
Creates a MARK model object that contains a MARK input file with PIMS and design matrix specific to the data and model structure and formulae specified for the model parameters.
make.mark.model( data, ddl, parameters = list(), title = "", model.name = NULL, initial = NULL, call = NULL, default.fixed = TRUE, options = NULL, profile.int = FALSE, chat = NULL, simplify = TRUE, input.links = NULL, parm.specific = FALSE, mlogit0 = TRUE, hessian = FALSE, accumulate = TRUE, icvalues = NULL, wrap = TRUE, nodes = 101, useddl = FALSE, check.model = FALSE )
data |
Data list resulting from function |
ddl |
Design data list from function |
parameters |
List of parameter formula specifications |
title |
Title for the analysis (optional) |
model.name |
Model name to override default name (optional) |
initial |
Vector of named or unnamed initial values for beta parameters or previously run model (optional) |
call |
Pass function call when this function is called from another
function (e.g. |
default.fixed |
if TRUE, real parameters for which the design data have been deleted are fixed to default values |
options |
character string of options for Proc Estimate statement in MARK .inp file |
profile.int |
if TRUE, requests MARK to compute profile intervals |
chat |
pre-specified value for chat used by MARK in calculations of model output |
simplify |
if FALSE, does not simplify PIM structure |
input.links |
specifies set of link functions for parameters with non-simplified structure |
parm.specific |
if TRUE, forces a link to be specified for each parameter |
mlogit0 |
if TRUE, any real parameter that is fixed to 0 and has an mlogit link will have its link changed to logit so it can be simplified |
hessian |
if TRUE specifies to MARK to use hessian rather than second partial matrix |
accumulate |
if TRUE accumulate like data values into frequencies |
icvalues |
numeric vector of individual covariate values for computation of real values |
wrap |
if TRUE, data lines are wrapped to be length 80; if length of a row is not a problem set to FALSE and it will run faster |
nodes |
number of integration nodes for individual random effects (min 15, max 505, default 101) |
useddl |
If TRUE and there are no missing rows or parameters (deleted) then it will use ddl in place of full.ddl that is created internally. |
check.model |
if TRUE, code does an internal consistency check between PIMs and design data when making model. |
This function is called by mark
to create the model but it can
be called directly to create but not run the model. All of the arguments
have default values except for the first 2 that specify the processed data
list (data
) and the design data list (ddl
). If only these 2
arguments are specified default models are used for the parameters. For
example, following with the example from process.data
and
make.design.data
, the default model can be created with:
mymodel=make.mark.model(proc.example.data,ddl)
The call to make.mark.model
creates a model object but does not do
the analysis. The function returns a list that includes several fields
including a design matrix and the MARK input file that will be used with
MARK.EXE
to run the analysis from function
run.mark.model
. The following shows the names of the list
elements in mymodel:
names(mymodel) [1] "data" "model" "title" "model.name" "links" [6] "mixtures" "call" "parameters" "input" "number.of.groups" [11] "group.labels" "nocc" "begin.time" "covariates" "fixed" [16] "design.matrix" "pims" "design.data" "strata.labels" "mlogit.list" [21] "simplify"
The list is defined to be a mark object which is done by assigning a class vector to the list. The classes for an R object can be viewed with the class function as shown below:
class(mymodel) [1] "mark" "CJS"
Each MARK model has 2 class
values. The first identifies it as a mark object and the second identifies
the type of mark analysis, which is the default "CJS" (recaptures only) in
this case. The use of the class feature has advantages in using generic
functions and identifying types of objects. An object of class mark
is defined in more detail in function mark
.
To fit non-trivial models it is necessary to understand the remaining
calling arguments of make.mark.model
and R formula notation. The
model formulae are specified with the calling argument parameters
.
It uses a similar list structure as the parameters
argument in
make.design.data
. It expects to get a list with elements named
to match the parameters in the particular analysis (e.g., Phi and p in CJS)
and each list element is a list, so it is a list of lists). For each
parameter, the possible list elements are formula, link, fixed,
component, component.name, remove.intercept
. In addition, for closed
capture models and robust design model, the element share
is included
in the list for p (capture probabilities) and GammaDoublePrime
(respectively) to indicate whether the model is shared (share=TRUE) or
not-shared (the default) (share=FALSE) with c (recapture probabilities) and
GammaPrime respectively.
formula
specifies the model for the parameter using R formula
notation. An R formula is denoted with a ~
followed by variables in
an equation format possibly using the +
, *
, and :
operators. For example, ~sex+age
is an additive model with the main
effects of sex
and age
. Whereas, ~sex*age
includes the
main effects and the interaction and it is equivalent to the formula
specified as ~sex+age+sex:age
where sex:age
is the interaction
term. The model ~age+sex:age
is slightly different in that the main
effect for sex
is dropped which means that intercept of the
age
effect is common among the sexes but the age pattern could vary
between the sexes. The model ~sex*Age
which is equivalent to
~sex + Age + sex:Age
has only 4 parameters and specifies a linear
trend with age and both the intercept and slope vary by sex. One additional
operator that can be useful is I()
which allows computation of
variables on the fly. For example, the addition of the Agesq variable in
the design data (as described above) can be avoided by using the notation
~Age + I(Age^2)
which specifies use of a linear and quadratic effect
for age. Note that specifying the model ~age + I(age^2)
would be
incorrect and would create an error because age
is a factor variable
whereas Age
is not.
As an example, consider developing a model in which Phi varies by age and p follows a linear trend over time. This model could be specified and run as follows:
p.Time=list(formula=~Time) Phi.age=list(formula=~age) Model.p.Time.Phi.age=make.mark.model(proc.example.data,ddl, parameters=list(Phi=Phi.age,p=p.Time)) Model.p.Time.Phi.age=run.mark.model(Model.p.Time.Phi.age)
The first 2 commands define the p and Phi models that are used in the
parameter
list in the call to make.mark.model
. This is a good
approach for defining models because it clearly documents the models, the
definitions can then be used in many calls to make.mark.model
and it
will allow a variety of models to be developed quickly by creating different
combinations of the parameter models. Using the notation above with the
period separating the parameter name and the description (eg., p.Time) gives
the further advantage that all possible models can be developed quickly with
the functions create.model.list
and
mark.wrapper
.
Model formula can use any field in the design data and any individual
covariates defined in data
. The restrictions on individual
covariates that was present in versions before 1.3 have now been removed.
You can now use interactions of individual covariates with all design data
covariates and products of individual covariates. You can specify
interactions of individual covariates and factor variables in the design
data with the formula notation. For example, ~region*x1
describes a
model in which the slope of x1
varies by region
. Also,
~time*x1
describes a model in which the slope for x1
varied by
time; however, there would be only one value of the covariate per animal so
this is not a time varying covariate model. Models with time varying
covariates are now more easily described with the improvements in version
1.3 but there are restrictions on how the time varying individual covariates
are named. An example would be a trap dependence model in which capture
probability on occasion i+1 depends on whether they were captured on
occasion i. If there are n occasions in a CJS model, the 0/1 (not
caught/caught) for occasions 1 to n-1 would be n-1 individual covariates to
describe recapture probability for occasions 2 to n. For times 2 to n, a
design data field could be defined such that the variable timex is 1 if
time==x and 0 otherwise. The time varying covariates must be named with a
time suffix on the base name of the covariate. In this example they would be
named as x2,. . .,xn
and the model could be specified as ~time
+ x
for time variation and a constant trap effect or as ~time +
time:x
for a trap effect that varied by time. If in the
process.data
call, the argument begin.time
was set to
the year 1990, then the variables would have to be named x1991,x1992,...
because the first recapture occasion would be 1991. Note that the times are
different for different parameters. For example, survival is labeled based
on the beginning of the time interval which would be 1990 so the variables
should be named appropriately for the parameter model in which they will be
used.
In previous versions to handle time varying covariates with a constant effect, it was necessary to use the component feature of the parameter specification to be able to append one or more arbitrary columns to the design matrix. That is no longer required for individual covariates and the component feature was removed in v2.0.8.
There are three other elements of the parameter list that can be useful on
occasion. The first is link
which specifies the link function for
transforming between the beta and real parameters. The possible values are
"logit", "log", "identity" and "mlogit(*)" where * is a numeric identifier.
The "sin" link is not allowed because all models are specified using a
design matrix. The typical default values are assigned to each parameter
(eg "logit" for probabilities, "log" for N, and "mlogit" for pent in POPAN),
so in most cases it will not be necessary to specify a link function.
The second is fixed
which allows real parameters to be set at fixed
values. While this is still allowed, if parameters are to be fixed for all models,
it is best to assign the field fix in the design dataframe and use NA for the values
to be estimated and the real fixed value for those values to be fixed. But if the
values are only to be fixed for a few models then you can still use the fixed approach
described here. The values for fixed
can be either a single value or a list
with 5 alternate forms for ease in specifying the fixed parameters.
Specifying fixed=value
will set all parameters of that type to the
specified value. For example, Phi=list(fixed=1)
will set all Phi to
1. This can be useful for situations like specifying F in the
Burnham/Barker models to all have the same value of 1. Fixed values can
also be specified as a list in which values are specified for certain
indices, times, ages, cohorts, and groups. The first 3 will be the most
useful. The first list format is the most general and flexible but it
requires an understanding of the PIM structure and index numbers for the
real parameters. For example,
Phi=list(formula=~time, fixed=list(index=c(1,4,7),value=1))
specifies Phi varying by time, but the real parameters 1,4,7 are set to 1.
The value
field is either a single constant or its length must match
the number of indices. For example,
Phi=list(formula=~time, fixed=list(index=c(1,4,7),value=c(1,0,1)))
sets real parameters 1 and 7 to 1 and real parameter 4 to 0. Technically,
the index/value format for fixed is not wedded to the parameter type (i.e.,
values for p can be assigned within Phi list), but for the sake of clarity
they should be restricted to fixing real parameters associated with the
particular parameter type. The time
and age
formats for fixed
will probably be the most useful. The format fixed=list(time=x, value=y)
will set all real parameters (of that type) for time x to value y. For
example,
p=list(formula=~time,fixed=list(time=1986,value=1))
sets up time varying capture probability but all values of p for 1986 are set to 1. This can be quite useful to set all values to 0 in years with no sampling (e.g.,
fixed=list(time=c(1982,1984,1986), value=0)
).
The age
, cohort
and group
formats work in a similar
fashion. It is important to recognize that the value you specify for
time
, age
, cohort
and group
must match the
values in the design data list. This is another reason to add binned fields
for age, time etc with add.design.data
after creating the
default design data with make.design.data
. Also note that the
values for time
and cohort
are affected by the
begin.time
argument specified in process.data
. Had I
not specified begin.time=1980
, to set p in the last occasion (1986),
the specification would be
p=list(formula=~time,fixed=list(time=7,value=1))
because begin.time defaults to 1. The advantage of the time-, age-, and
cohort- formats over the index-format is that it will work regardless of the
group definition which can easily be changed by changing the groups
argument in process.data
. The index-format will be dependent
on the group structure because the indexing of the PIMS will most likely
change with changes in the group structure.
Starting with version 3.0.0 deleting design data as deacribed in this paragraph is no longer allowed. Instead, add a fix field to the design data and set the fixed values as described above.
#### No longer supported###
Parameters can also be fixed at default values by deleting the specific rows
of the design data. See make.design.data
and material below.
The default value for fixing parameters for deleted design data can be
changed with the default=value
in the parameter list.
Another useful element of the parameter list is the
remove.intercept
argument. It is set to TRUE to forcefully remove
the intercept. In R notation this can be done by specifying the formula
notation ~-1+... but in formula with nested interactions of factor variables
and additive factor variables the -1 notation will not remove the intercept.
It will simply adjust the column definitions but will keep the same number
of columns and the model will be over-parameterized. The problem occurs with
nested factor variables like tostratum within stratum for multistrata
designs (see mstrata
). As shown in that example, you can
build a formula -1+stratum:tostratum to have transitions that are
stratum-specific. If however you also want to add a sex effect and you
specify -1+sex+stratum:tostratum it will add 2 columns for sex labelled M
and F when in fact you only want to add one column because the intercept is
already contained within the stratum:tostratum term. The argument
remove.intercept will forcefully remove the intercept but it needs to be
able to find a column with all 1's. For example,
Psi=list(formula=~sex+stratum:tostratum,remove.intercept=TRUE) will work but
Psi=list(formula=~-1+sex+stratum:tostratum,remove.intercept=TRUE) will not
work. Also, the -1 notation should be used when there is not an added
factor variable because
Psi=list(formula=~stratum:tostratum,remove.intercept=TRUE)
will not work because while the stratum:tostratum effectively includes an intercept it is equivalent to using an identity matrix and is not specified as treatment contrast with one of the columns as all 1's.
The final argument of the parameter list is contrast which can be used to change the contrast used for model.matrix. It uses the default if none specified. The form is shown in ?model.matrix.
The argument simplify determines whether the pims are simplified such that only indices for unique and fixed real parameters are used. For example, with an all different PIM structure with CJS with K occasions there are K*(K-1) real parameters for Phi and p. However, if you use simplify=TRUE with the default model of Phi(.)p(.), the pims are re-indexed to be 1 for all the Phi's and 2 for all the p's because there are only 2 unique real parameters for that model. Using simplify can speed analysis markedly and probably should always be used. This was left as an argument only to test that the simplification was working and produced the same likelihood and real parameter estimates with and without simplification. It only adjust the rows of the design matrix and not the columns. There are some restrictions for simplification. Real parameters that are given a fixed value are maintained in the design matrix although it does simplify among the fixed parameters. For example, if there are 50 real parameters all fixed to a value of 1 and 30 all fixed to a value of 0, they are reduced to 2 real parameters fixed to 1 and 0. Also, real parameters such as Psi in Multistrata and pent in POPAN that use multinomial logits are not simplified because they must maintain the structure created by the multinomial logit link. All other parameters in those models are simplified. The only downside of simplification is that the labels for real parameters in the MARK output are unreliable because there is no longer a single label for the real parameter because it may represent many different real parameters in the all-different PIM structure. This is not a problem with the labels in R because the real parameter estimates are translated back to the all-different PIM structure with the proper labels.
Be careful specifying interactions and make sure that all combinations are present when interacting 2 or more factor variables. For example, with 2 factors A and B, each with 2 levels, the data are fully crossed if there are data with values A1&B1, A1&B2, A2&B1 and A2&B2. If data exist for each of the 4 combinations then you can described the interaction model as ~A*B and it will estimate 4 values. However, if data are missing for one of more of the 4 cells then the "interaction" formula should be specified as ~-1+A:B or ~-1+B:A or ~-1+A the combinations with data. An example of this could be a marking program with multiple sites which resighted at all occasions but which only marked at sites on alternating occasions. In that case time is nested within site and time-site interaction models would be specified as ~-1+time:site. Another example is age*time when you only mark young of the year. In that case, the first occasion only contains young, the second occasion newly marked young and age 1 if occasions are a year apart. In that case, using something like above ~-1+time:age is useful.
The argument title
supplies a character string that is used to label
the output. The argument model.name
is a descriptor for the model
used in the analysis. The code constructs a model name from the formula
specified in the call (e.g., Phi(~1)p(~time)
) but on occasion the
name may be too long or verbose, so it can be over-ridden with the
model.name
argument.
The final argument initial
can be used to provide initial estimates
for the beta parameters. It is either 1) a single starting value for each
parameter, 2) an unnamed vector of values (one for each parameter), 3) named
vector of values, or 4) the name of mark
object that has already been
run. For cases 3 and 4, the code only uses appropriate initial beta
estimates in which the column names of the design matrix (for model) or
vector names match the column names in the design matrix of the model to be
run. Any remaining beta parameters without an initial value specified are
assigned a 0 initial value. If case 4 is used the models must have the same
number of rows in the design matrix and thus presumably the same structure.
As long as the vector elements are named (#3), the length of the
initial
vector no longer needs to match the number of parameters in
the new model as long as the elements are named. The names can be retrieved
either from the column names of the design matrix or from
rownames(x$results$beta)
where x
is the name of the
mark
object.
options
is a character string that is tacked onto the Proc
Estimate
statement for the MARK .inp file. It can be used to request
options such as NoStandDM (to not standardize the design matrix) or
SIMANNEAL (to request use of the simulated annealing optimization method) or
any existing or new options that can be set on the estimate proc.
model: a MARK object except for the elements output
and
results
. See mark
for a detailed description of the
list contents.
Jeff Laake
process.data
,make.design.data
,
run.mark.model
mark
# This example is excluded from testing to reduce package check time data(dipper) # # Process data # dipper.processed=process.data(dipper,groups=("sex")) # # Create default design data # dipper.ddl=make.design.data(dipper.processed) # # Add Flood covariates for Phi and p that have different values # dipper.ddl$Phi$Flood=0 dipper.ddl$Phi$Flood[dipper.ddl$Phi$time==2 | dipper.ddl$Phi$time==3]=1 dipper.ddl$p$Flood=0 dipper.ddl$p$Flood[dipper.ddl$p$time==3]=1 # # Define range of models for Phi # Phidot=list(formula=~1) Phitime=list(formula=~time) PhiFlood=list(formula=~Flood) # # Define range of models for p # pdot=list(formula=~1) ptime=list(formula=~time) # # Make assortment of models # dipper.phidot.pdot=make.mark.model(dipper.processed,dipper.ddl, parameters=list(Phi=Phidot,p=pdot)) dipper.phidot.ptime=make.mark.model(dipper.processed,dipper.ddl, parameters=list(Phi=Phidot,p=ptime)) dipper.phiFlood.pdot=make.mark.model(dipper.processed,dipper.ddl, parameters=list(Phi=PhiFlood, p=pdot))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.