run-mcmc: Running Markov Chain Monte Carlo for Parameters of Total...

run.tfr.mcmcR Documentation

Running Markov Chain Monte Carlo for Parameters of Total Fertility Rate in Phase II

Description

Runs (or continues running) MCMCs for simulating the total fertility rate of all countries of the world (phase II), using a Bayesian hierarchical model.

Usage

run.tfr.mcmc(nr.chains = 3, iter = 62000, 
    output.dir = file.path(getwd(), "bayesTFR.output"), 
    thin = 1, replace.output = FALSE, annual = FALSE, uncertainty = FALSE, 
    start.year = 1950, present.year = 2020, wpp.year = 2019, 
    my.tfr.file = NULL, my.locations.file = NULL, my.tfr.raw.file = NULL, 
    use.wpp.data = TRUE, ar.phase2 = FALSE, buffer.size = 100, 
    raw.outliers = c(-2, 1),
    U.c.low = 5.5, U.up = 8.8, U.width = 3,
    mean.eps.tau0 = -0.25, sd.eps.tau0 = 0.4, nu.tau0 = 2, 
    Triangle_c4.low = 1, Triangle_c4.up = 2.5, 
    Triangle_c4.trans.width = 2,
    Triangle4.0 = 0.3, delta4.0 = 0.8, nu4 = 2,
    S.low = 3.5, S.up = 6.5, S.width = 0.5, 
    a.low = 0, a.up = 0.2, a.width = 0.02, 
    b.low = a.low, b.up = a.up, b.width = 0.05, 
    sigma0.low = if (annual) 0.0045 else 0.01, sigma0.up = 0.6,  
    sigma0.width = 0.1, sigma0.min = 0.04, 
    const.low = 0.8, const.up = 2, const.width = 0.3, 
    d.low = 0.05, d.up = 0.5, d.trans.width = 1, 
    chi0 = -1.5, psi0 = 0.6, nu.psi0 = 2, 
    alpha0.p = c(-1, 0.5, 1.5), delta0 = 1, nu.delta0 = 2, 
    dl.p1 = 9, dl.p2 = 9, phase3.parameter=NULL,
    S.ini = NULL, a.ini = NULL, b.ini = NULL, sigma0.ini = NULL, 
    Triangle_c4.ini = NULL, const.ini = NULL, gamma.ini = 1, 
    phase3.starting.values = NULL, proposal_cov_gammas = NULL, 
    iso.unbiased = NULL, covariates = c("source", "method"), cont_covariates = NULL, 
    source.col.name="source", seed = NULL, parallel = FALSE, nr.nodes = nr.chains, 
    save.all.parameters = FALSE, compression.type = 'None',
    auto.conf = list(max.loops = 5, iter = 62000, iter.incr = 10000, 
        nr.chains = 3, thin = 80, burnin = 2000),
    verbose = FALSE, verbose.iter = 10, ...)
		
continue.tfr.mcmc(iter, chain.ids = NULL, 
    output.dir = file.path(getwd(), "bayesTFR.output"), 
    parallel = FALSE, nr.nodes = NULL, auto.conf = NULL,
    verbose = FALSE, verbose.iter = 10, ...)

Arguments

nr.chains

Number of MCMC chains to run.

iter

Number of iterations to run in each chain. In addition to a single value, it can have the value ‘auto’ in which case the function runs for the number of iterations given in the auto.conf list (see below), then checks if the MCMCs converged (using the auto.conf settings). If it did not converge, the procedure is repeated until convergence is reached or the number of repetition exceeded auto.conf$max.loops.

output.dir

Directory which the simulation output should be written into.

thin

Thinning interval between consecutive observations to be stored on disk.

replace.output

If TRUE, existing outputs in output.dir will be replaced by results of this simulation.

annual

If TRUE, the model will be trained based on annual TFR data.

uncertainty

Logical. If TRUE, the model described in Liu and Raftery(2020) which takes into account uncertainty about the past TFR observations is used. It will take the observations from rawTFR or from a file given by my.tfr.raw.file, estimate the distribution of these observations with respect to the true TFR. Then instead of treating the observed data as true data, it assumes the true TFR is unknown and include an extra step for estimating past TFR.

use.wpp.data

Logical indicating if default WPP data should be used, i.e. if my.tfr.file will be matched with the WPP data in terms of time periods and locations. If FALSE, it is assumed that the my.tfr.file contains all locations and time periods to be included in the simulation.

ar.phase2

Logical where TRUE implies that the autoregressive component on the residual (for Phase II) is considered as a global parameter. Only used if annual is TRUE. See details below.

start.year

Start year for using historical data.

present.year

End year for using historical data.

wpp.year

Year for which WPP data is used. The functions loads a package called wppx where x is the wpp.year and uses the tfr* datasets.

my.tfr.file

File name containing user-specified TFR time series for one or more countries. See Details below.

my.locations.file

File name containing user-specified locations. See Details below.

my.tfr.raw.file

File name of the raw TFR, used when uncertainty is TRUE. See details below.

buffer.size

Buffer size (in number of iterations) for keeping data in the memory. The smaller the buffer.size the more often will the process access the hard disk and thus, the slower the run. On the other hand, the smaller the buffer.size the less data will be lost in case of failure.

raw.outliers

Vector of size two giving the maximum annual decrease and increase of raw TFR change, respectively. The default values mean that any raw TFR data that decrease more than 2 or increase more than 1 in one year are considered as outliers.

U.c.low, U.up

Lower and upper bound of the parameter U_c, the start level of the fertility transition. The lower bound is set for each country as the maximum of U.c.low and the minimum of historical TFR for that country.

U.width

Width for slice sampling used when updating the U_c parameter.

mean.eps.tau0, sd.eps.tau0

Mean and standard deviation of the prior distribution of m_{\tau} which is the mean of the distortion terms \epsilon_{c,\tau} in start periods \tau_c.

nu.tau0

The shape parameter of the prior Gamma distribution of 1/s_{\tau}^2 is nu.tau0/2. s_{\tau} is standard deviation of the distortion terms in start periods \tau_c.

Triangle_c4.low, Triangle_c4.up

Lower and upper bound of the \Delta_{c4} parameter.

Triangle_c4.trans.width

Width for slice sampling used when updating the logit-transformed \Delta_{c4} parameter.

Triangle4.0, delta4.0

Mean and standard deviation of the prior distribution of the \Delta_4 parameter which is the hierarchical mean of the logit-transformed \Delta_{c4}.

nu4

The shape parameter of the prior Gamma distribution of 1/\delta_4^2 is nu4/2. \delta_4 is standard deviation of the logit-transformed \Delta_{c4}.

S.low, S.up

Lower and upper bound of the uniform prior distribution of the S parameter which is the TFR at which the distortion has maximum variance.

S.width

Width for slice sampling used when updating the S parameter.

a.low, a.up

Lower and upper bound of the uniform prior distribution of the a parameter which is a coefficient for linear decrease of the TFR for TFR larger than S.

a.width

Width for slice sampling used when updating the a parameter.

b.low, b.up

Lower and upper bound of the uniform prior distribution of the b parameter which is a coefficient for linear decrease of the TFR for TFR smaller than S.

b.width

Width for slice sampling used when updating the b parameter.

sigma0.low, sigma0.up

Lower and upper bound of the uniform prior distribution of the \sigma_0 parameter. \sigma_0^2 is the maximum variance of the distortions at TFR equals S.

sigma0.width

Width for slice sampling used when updating the \sigma_0 parameter.

sigma0.min

Minimum value that \sigma_0 can take.

const.low, const.up

Lower and upper bound of the uniform prior distribution of the c parameter which is the multiplier of standard deviation of the distortions before 1975.

const.width

Width for slice sampling used when updating the c parameter.

d.low, d.up

Lower and upper bound of the parameter d_c, the maximum annual decrement for country c. (Note that in Alkema et al. this parameter is a five-years decrement.)

d.trans.width

Width for slice sampling used when updating the logit-transformed d_c parameter.

chi0, psi0

Prior mean and standard deviation of the \chi parameter which is the hierarchical mean of logit-transformed maximum decline parameter d_c.

nu.psi0

The shape parameter of the prior Gamma distribution of 1/\psi^2 is nu.psi0/2. \psi is the standard devation of logit-transformed maximum decline parameter d_c.

alpha0.p

Vector of prior means of the \alpha_i parameters, i=1,2,3. \alpha_i is the hierarchical mean of \gamma_{ci}.

delta0

Prior standard deviation of the \alpha_i parameters. It is a single value, i.e. the same standard deviation is used for all i.

nu.delta0

The shape parameter of the prior Gamma distribution of 1/\delta_i^2 is nu.delta0/2. \delta_i is the standard deviation of \gamma_{ci}.

dl.p1, dl.p2

Values of the parameters p_1 and p_2 of the double logistic function.

phase3.parameter

When uncertainty=TRUE, we need to combine the MCMC process for Phase II and Phase III together. This parameter is used to provide a list for phase3 initial ranges, such as mu.prior.range. If the input is NULL, the default values will be used.

S.ini

Initial value for the S parameter. It can be a single value or an array of the size nr.chains. By default, if nr.chains is 1, it is the middle point of the interval [S.low, S.up]. Otherwise, it is equally spaced distributed between S.low and S.up.

a.ini

Initial value for the a parameter. It can be a single value or an array of the size nr.chains. By default, if nr.chains is 1, it is the middle point of the interval [a.low, a.up]. Otherwise, it is equally spaced distributed between a.low and a.up.

b.ini

Initial value for the b parameter. It can be a single value or an array of the size nr.chains. By default, if nr.chains is 1, it is the middle point of the interval [b.low, b.up]. Otherwise, it is equally spaced distributed between b.low and b.up.

sigma0.ini

Initial value for the \sigma_0 parameter. It can be a single value or an array of the size nr.chains. By default, if nr.chains is 1, it is the middle point of the interval [sigma0.low, sigma0.up]. Otherwise, it is equally spaced distributed between sigma0.low and sigma0.up.

Triangle_c4.ini

Initial value for the \Delta_{c4} parameter. It can be a single value or an array of the size nr.chains. By default, if nr.chains is 1, it is the middle point of the interval [Triangle_c4.low, Triangle_c4.up]. Otherwise, it is equally spaced distributed between Triangle_c4.low and Triangle_c4.up.

const.ini

Initial value for the c parameter. It can be a single value or an array of the size nr.chains. By default, if nr.chains is 1, it is the middle point of the interval [const.low, const.up]. Otherwise, it is equally spaced distributed between const.low and const.up.

gamma.ini

Initial value for the \gamma_c parameter. The same value is used for all countries.

phase3.starting.values

This parameter is used to provide a list of Phase 3 initial values, such as mu.ini and rho.ini in run.tfr3.mcmc. If the input is NULL, the default values will be used.

proposal_cov_gammas

Proposal for the gamma covariance matrices for each country. It should be a list with two values: values and country_codes. The structure corresponds to the object returned by the function get.cov.gammas. By default the covariance matrices are obtained using data(proposal_cov_gammas_cii). This argument overwrite the defaults for countries contained the argument.

iso.unbiased

Codes of countries for which the vital registration TFR estimates are considered unbiased. Only used if uncertainty = TRUE. See details below.

covariates, cont_covariates

Categorical and continuous features used in estimating bias and standard deviation if uncertainty = TRUE. See details below.

source.col.name

If uncertainty is TRUE this is a column name within the given covariates that determines the data source. It is used if iso.unbiased is given to identify the vital registration records.

seed

Seed of the random number generator. If NULL no seed is set. It can be used to generate reproducible results.

parallel

Logical determining if the simulation should run multiple chains in parallel. If it is TRUE, the package snowFT is required. In such a case further arguments can be passed, see description of ....

nr.nodes

Relevant only if parallel is TRUE. It gives the number of nodes for running the simulation in parallel. By default it equals to the number of chains.

save.all.parameters

If TRUE, the distortion terms \epsilon_{c,t} for all t are stored on disk, otherwise not.

compression.type

One of ‘None’, ‘gz’, ‘xz’, ‘bz’, determining type of a compression of the MCMC files. Important: Do not use this option for a long MCMC simulation as this tends to cause very long run times due to slow reading!

auto.conf

List containing a configuration for an ‘automatic’ run (see description of argument iter). Item iter gives the number of iterations in the first chunk of the MCMC simulation; item iter.incr gives the number of iterations in the following chunks; nr.chains gives the number of chains in all chunks of the MCMC simulation; items thin and burnin are used in the convergence diagnostics following each chunk; max.loops controls the maximum number of chunks. All items must be integer values. This argument is only used if the function argument iter is set to ‘auto’.

verbose

Logical switching log messages on and off.

verbose.iter

Integer determining how often (in number of iterations) log messages are outputted during the estimation.

...

Additional parameters to be passed to the function performParallel, if parallel is TRUE. For example cltype which is ‘SOCK’ by default but can be set to ‘MPI’.

chain.ids

Array of chain identifiers that should be resumed. If it is NULL, all existing chains in output.dir are resumed.

Details

The function run.tfr.mcmc creates an object of class bayesTFR.mcmc.meta and stores it in output.dir. It launches nr.chains MCMCs, either sequentially or in parallel. Parameter traces of each chain are stored as (possibly compressed) ASCII files in a subdirectory of output.dir, called mcx where x is the identifier of that chain. There is one file per parameter, named after the parameter with the suffix “.txt”, possibly followed by a compression suffix if compression.type is given. Country-specific parameters (U, d, \gamma) have the suffix _cy where y is the country code. In addition to the trace files, each mcx directory contains the object bayesTFR.mcmc in binary format. All chain-specific files are written into disk after the first, last and each buffer.size-th iteration.

Using the function continue.tfr.mcmc one can continue simulating an existing MCMCs by iter iterations for either all or selected chains.

The function loads observed data (further denoted as WPP dataset) from the tfr and tfr_supplemental datasets in a wppx package where x is the wpp.year. It is then merged with the include dataset that corresponds to the same wpp.year. The argument my.tfr.file can be used to overwrite those default data. If use.wpp.data is FALSE, it fully replaces the default dataset. Otherwise (by default), such a file can include a subset of countries contained in the WPP dataset, as well as a set of new countries. In the former case, the function replaces the corresponding country data from the WPP dataset by values in this file. Only columns are replaced that match column names of the WPP dataset, and in addition, columns ‘last.observed’ and ‘include_code’ are used, if present. Countries are merged with WPP using the column ‘country_code’. In addition, in order the countries to be included in the simulation, in both cases (whether they are included in the WPP dataset or not), they must be contained in the table of locations (UNlocations). In addition, their corresponding include_code must be set to 2. If the column ‘include_code’ is present in my.tfr.file, its value overwrites the default include code, unless it is -1.

The default UN table of locations mentioned above can be overwritten/extended by using a file passed as the my.locations.file argument. Such a file must have the same structure as the UNlocations dataset. Entries in this file will overwrite corresponding entries in UNlocations matched by the column ‘country_code’. If there is no such entry in the default dataset, it will be appended. This option of appending new locations is especially useful in cases when my.tfr.file contains new countries/regions that are not included in UNlocations. In such a case, one must provide a my.locations.file with a definition of those countries/regions.

For simulation of the hyperparameters of the Bayesian hierarchical model, all countries are used that are included in the WPP dataset, possibly complemented by the my.tfr.file, that have include_code equal to 2. The hyperparameters are used to simulate country-specific parameters, which is done for all countries with include_code equal 1 or 2. The following values of include_code in my.tfr.file are recognized: -1 (do not overwrite the default include code), 0 (ignore), 1 (include in prediction but not estimation), 2 (include in both, estimation and prediction). Thus, the set of countries included in the estimation and prediction can be fully user-specific.

Optionally, my.tfr.file can contain a column called last.observed containing the year of the last observation for each country. In such a case, the code would ignore any data after that time point. Furthermore, the function tfr.predict fills in the missing values using the median of the BHM procedure (stored in tfr_matrix_reconstructed of the bayesTFR.prediction object). For last.observed values that are below a middle year of a time interval [t_i, t_{i+1}] (computed as t_i+3) the last valid data point is the time interval [t_{i-1}, t_i], whereas for values larger equal a middle year, the data point in [t_i, t_{i+1}] is valid.

The package contains a dataset called ‘my_tfr_template’ (in ‘extdata’ directory) which is a template for user-specified my.tfr.file.

The parameter uncertainty is set to control whether past TFR is considered to be precise (FALSE), or need to be estimated from the raw data (TRUE). In the latter case, the raw TFR observations are taken either from the rawTFR dataset (default) or from a file given by the my.tfr.raw.file argument. The Bayesian hierarchical model considers the past TFR as unknown, estimates it and stores in output.dir. Details can be found in Liu and Raftery (2020). The covariates, cont_covariates arguments are for listing categorical and continuous features for estimating bias and standard deviation of past TFR observations. If a country is known to have unbiased vital registration (VR) records, one can include it in the iso.unbiased argument as those countries will estimate their past VR records to have 0 bias and 0.0161 standard deviation. The VR records are identified as having “VR” in the column given by source.col.name (“source” by default).

If annual=TRUE, which implies using annual data for training the model, the parameter ar.phase2 will be activated. If ar.phase2 is set to TRUE, then the model of Phase II will change from d_{c,t} = g_{c,t} + \epsilon_{c,t} to d_{c,t}-g_{c,t} = \phi(d_{c,t-1}-g_{c,t-1}) + \epsilon_{c,t}. \phi is considered as country-independent and is called rho_phase2.

Furthermore, if annual is TRUE and my.tfr.file is given, the data in the file must be on annual basis and no matching with the WPP dataset takes place.

Value

An object of class bayesTFR.mcmc.set which is a list with two components:

meta

An object of class bayesTFR.mcmc.meta.

mcmc.list

A list of objects of class bayesTFR.mcmc, one for each MCMC.

Author(s)

Hana Sevcikova, Leontine Alkema, Peiran Liu

References

Peiran Liu, Hana Sevcikova, Adrian E. Raftery (2023): Probabilistic Estimation and Projection of the Annual Total Fertility Rate Accounting for Past Uncertainty: A Major Update of the bayesTFR R Package. Journal of Statistical Software, 106(8), 1-36. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v106.i08")}.

L. Alkema, A. E. Raftery, P. Gerland, S. J. Clark, F. Pelletier, Buettner, T., Heilig, G.K. (2011). Probabilistic Projections of the Total Fertility Rate for All Countries. Demography, Vol. 48, 815-839. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s13524-011-0040-5")}.

P. Liu, and A. E. Raftery (2020). Accounting for Uncertainty About Past Values In Probabilistic Projections of the Total Fertility Rate for All Countries. Annals of Applied Statistics, Vol 14, no. 2, 685-705. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/19-AOAS1294")}.

See Also

get.tfr.mcmc, summary.bayesTFR.mcmc.set.

Examples

## Not run: 
sim.dir <- tempfile()
m <- run.tfr.mcmc(nr.chains = 1, iter = 5, output.dir = sim.dir, verbose = TRUE)
summary(m)
m <- continue.tfr.mcmc(iter = 5, verbose = TRUE)
summary(m)
unlink(sim.dir, recursive = TRUE)
## End(Not run)

PPgp/bayesTFR documentation built on Nov. 5, 2024, 1:57 p.m.