singleCAM: Continuous Admixture Modeling (CAM) for a Single LD Decay...
In QIU-Hongxiang-David/CAMer: Continuous Admixture Modeler

Description Usage Arguments Details Value Note See Also Examples

View source: R/CAMer.R

Find the estimated time intervals/point for HI, CGF1(-I), CGF2(-I) and GA(-I) models and corresponding statictis (ssE, msE, etc.) for a single LD decay curve (e.g. Combined_LD or Jack? in a .rawld file).

1
2
3

singleCAM(d, Z, m1, T = 500L, isolation = TRUE, fast.search = TRUE,
  max.duration = 150L, single.parallel = isolation && !fast.search,
  single.clusternum = 4L)

`d`	the numeric vector of genetic distance (Morgan) of LD decay curve
`Z`	the numeric vector of admixture induced LD (ALD) decay curve
`m1`	the admixture proportion of population 1 or the path of the .log file containing this information. If `m2` is the admixing proportion of population 2, then `m1+m2=1`. The .log file should be the output of `MALDmef`.
`T`	the most ancient generation to be searched. Defaults to 500.
`isolation`	`TRUE` if the models used for fitting are HI, CGF1-I, CGF2-I and GA-I; `FALSE` if the models used for fitting are HI, CGF1, CGF2 and GA. Defaults to `TRUE`.
`fast.search`	Defaults to `TRUE`. See "Details".
`max.duration`	Defaults to 150. See "Details".
`single.parallel`	a logical expression indicating whether parallel computation should be used. Defaults to `TRUE` if `isolation=TRUE,fast.search=FALSE` and `FALSE` otherwise.
`single.clusternum`	the number of clusters in parallel computation. Defaults to 4 for the four models. Used if `single.parallel=TRUE`.

fast.search is only used when isolation=TRUE. TRUE to use the fast searching algorithm, which sometimes gives slightly wider time intervals than the slow searching algorithm. It is recommended to use fast.search=TRUE (default), not only because it is significantly faster, but also because according to our experience it can partially solve the over-fitting problem of CGF-I and GA-I models so that HI usually does not perform significantly worse than them.

max.duration is only used when isolation=TRUE and fast.search=FALSE. The maximal duration of admixture n to be considered as possible. Smaller values can make the slow searching algorithm faster. If max.duration>T, it will be set to be T.

Given a single LD decay curve, for each model, this function does the following:

If isolation=FALSE, it goes through all possible time intervals/points in [0,T], each time estimating θ_0 and θ_1 for the corresponding interval/point, and chooses the time interval/point that achieves the smallest ssE as the estimate for the model. Each corresponding θ=(θ_0,θ_1) is the estimted θ for each model.

If isolation=TRUE,fast.search=FALSE, it also goes through all possible time intervals/points to estimate parameters. This slow algorithm is not recommended as it takes more than 40 minutes if T=500L,max.duration=150L and Z has length 3497 without parallel computation.

If isolation=TRUE,fast.search=TRUE, for CGF1-I, CGF2-I, GA-I models, it uses a fast searching algorithm to search for a local minimum of ssE. This local minimum is not guaranteed to be the global minimum as that in the slow algorithm, but usually it is the same or quite close to that. It is recommended to use the fast algorithm because it takes only about 2 minutes if T=500L,max.duration=150L and Z has length 3497 without parallel computation.

maxindex is the index of Z such that Z[maxindex] is the maximal value of Z. If the first few values of Z are not decreasing as theoretically expected, the 1:maxindex of Z and d will be removed in calculation and in returned values.

If the last entry of distence is greater than 10, a warning of unit will be given.

If the estimated time intervals/points cover T, a warning of too small T is given. The user should re-run the function with a larger T so that optimal time intervals/points can be reached.

Require parallel or snow package installed if single.parallel=TRUE. For newer versions of R (>=2.14.0), parallel is in R-core. If only snow is available, it is recommended to library it before using the parallel computing funcationality. When only snow is available, it will be require-d and hence the search path will be changed; if parallel is available, it will be used but the search path will not be changed. One may go to https://cran.r-project.org/src/contrib/Archive/snow/ to download and install older versions of snow if the version of R is too old. If neither of the packages is available but single.parallel=TRUE, the function will compute sequentially with messages.

Be aware that when the computational cost is small (e.g. isolation=FALSE or T=20L,isoaltion=TRUE,fast.search=FALSE,max.duration=10L), using parallel computation can result in longer computation time.

There is a special method of plot and print for this class.

an object of S3 class "CAM.single". A list consisting of:

`call`	the matched call
`maxindex`	the index of the maximal value in `Z` See "Details".
`d,Z`	identical to function inputs up to some truncation. See "Details"
`T,isolation`	identical to function inputs
`A`	numeric matrix A with the (i,j)-th entry being \text{exp}(-j \cdot d_i), d_i meaning the i-th entry of `d` and j meaning the genertion.
`m1,m2`	admixture proportion of population 1 and 2
`estimate`	a list of estimates. Each element contains the estimated parameters m, n, θ_0, θ_1, starting generation, ending generation and the corresponding ssE and msE. The time point for HI model is stored in `start` variable.
`summary`	a data frame containing the information in `estimate` in a compact form

If the input of m1 is the .log file path, there should not be any "=" in the names of populations. If there are, the function may not be able to execute normally, and the user should check the .log file and input m1 as a number manually.

When LD.parallel=TRUE or single.parallel=TRUE, it is not recommended to terminate the execution of the function. If parallel package is available, it is said that setDefaultCluster from parallel can be used to remove the registered cluster, but real experiments does not support this; fortunately, these unused clusters will be removed automatically later, but with warnings. If only snow package is available, according to http://homepage.stat.uiowa.edu/~luke/R/cluster/cluster.html, "don't interrupt a snow computation". The ultimate method to close the unused clusters is probably to quit the R session.

CAM, reconstruct.fitted, conclude.model

data(CGF_50)
Z<-CGF_50$Combined_LD
d<-CGF_50$Distance

#fit models with isolation=FALSE
fit<-singleCAM(d,Z,m1=0.3,T=10L,isolation=FALSE) #with warning

#re-run with larger T
fit<-singleCAM(d,Z,m1=0.3,T=100L,isolation=FALSE)
fit

#fit models with isolation=TRUE using fast searching algorithm
fit<-singleCAM(d,Z,m1=0.3,T=100L)
fit

#fit models with isolation=TRUE using slow searching algorithm
#with parallel computation
fit<-singleCAM(d,Z,m1=0.3,T=100L,fast.search=FALSE,
               single.parallel=TRUE,single.clusternum=4L)
fit

#fit models with isolation=TRUE using slow searching algorithm
#without parallel computation
fit<-singleCAM(d,Z,m1=0.3,T=70L,fast.search=FALSE,single.parallel=FALSE)
fit

fitted.curves<-reconstruct.fitted(fit)