tune1.subsets: Imputes dense map of SNPs on chromosome regions with MaCH
In genMOSSplus: Application of MOSS algorithm to genome-wide association study (GWAS)

Description Usage Arguments Details Value Author(s) References See Also Examples

For chromosomes and their small regions specified, run MaCH1 with hapmap to get more detailed sampling of SNPs in the region, and prepares this subset of data to be processed by MOSS algorithm.

tune1.subsets(dir.dat, dir.ped, dir.ann, dir.pos.snp, dir.pos.ann, 
dir.pos.hap, dir.out, prefix.dat, prefix.ped, prefix.ann, prefix.pos.snp, 
prefix.pos.ann, prefix.pos.hap, key.dat = "", key.ann = "", 
key.pos.ann = "", key.pos.hap = "", ending.dat = ".dat", 
ending.ped = ".ped", ending.ann = ".map", ending.pos.snp = ".snps", 
ending.pos.ann = "annotation.txt", ending.pos.hap = ".hap.gz", 
pos.list.triple, ped.nonsnp = 5, ann.header=FALSE, ann.snpcol=2, 
ann.poscol=4, ann.chrcol=0, pos.ann.header = TRUE, pos.ann.snpcol = 5, 
pos.ann.poscol = 2, pos.hap.nonsnp = 2, out.name.subdir = "seg1", 
out.prefix = "subdata", rsq.thresh = 0.5, num.iters = 2, 
hapmapformat = FALSE, mach.loc = "/software/mach1")

`dir.dat`	The name of directory where file listing SNPs of the dataset can be found.
`dir.ped`	The name of directory where file with data of the dataset can be found.
`dir.ann`	The name of directory where SNP position information for the dataset can be found. Note: this file must contain position information about all SNPs that are listed in .dat; all other SNPs will be ignored.
`dir.pos.snp`	The name of directory where hapmap SNP list can be found.
`dir.pos.ann`	The name of directory where hapmap annotation file containing position information can be found.
`dir.pos.hap`	The name of directory where the hapmap zipped data can be found.
`dir.out`	The name of directory to which output folder should be placed.
`prefix.dat`	The beginning of the file name for dataset's list of SNPs.
`prefix.ped`	The beginning of the file name for dataset's data.
`prefix.ann`	The beginning of the file name for dataset's SNP position information.
`prefix.pos.snp`	The beginning of the file name for hapmap's list of SNPs.
`prefix.pos.ann`	The beginning of the file name for hapmap's SNP position information.
`prefix.pos.hap`	The beginning of the file name for hapmap's data.
`key.dat`	Any keyword in the name of dataset's list of SNPs.
`key.ann`	Any keyword in the name of dataset's SNP position information.
`key.pos.ann`	Any keyword in the name of hapmap's SNP position information.
`key.pos.hap`	Any keyword in the name of hapmap's data.
`ending.dat`	The ending of dataset's list of SNPs filename.
`ending.ped`	The ending of dataset's data filename.
`ending.ann`	The ending of dataset's SNP position information filename.
`ending.pos.snp`	The ending of hapmap's list of SNPs filename.
`ending.pos.ann`	The ending of hapmap's SNP position information filename.
`ending.pos.hap`	The ending of hapmap's data filename.
`pos.list.triple`	A list of chromosomes and position boundaries to be expanded upon. The list should contain information in the order: (chromosome number, start position, end position, chromosome number, start position, end position, etc.). This allows users to specify multiple chromosomes with multiple regions within each chromosome. For example, specifying region of positions 6000-19000 and 111000-222000 in chrom 15, together with positions 55000-77000 in chrom 21, can be listed as: c(15, 6000, 19000, 15, 111000, 222000, 21, 55000, 77000). Note that MaCH will be run on one chromosome at a time, and for all its specified regions.
`ped.nonsnp`	The number of non-snp leading columns in dataset's data file. For example input to MaCH format has 5 columns, Plink has 6 columns.
`ann.header`	Whether or not hapmap's SNP position information file has a header. Ex. .annotation.txt = TRUE, .legend.txt = TRUE. Since format of the hapmap file is not hard-coded, specify the format of your prefered hapmap library; the defaults are set to the 1000 Genome data (from MaCH website).
`pos.ann.snpcol`	The column number in hapmap's SNP position information file that contains SNP names/ids. For example in .annotation.txt it's column 5; in .legend.txt it's column 1.
`pos.ann.poscol`	The column number in hapmap's SNP position information file that contains position information. For example in .annotation.txt it's column 2; in .legend.txt it's also column 2.
`pos.ann.header`	Whether or not dataset's SNP position information file has a header. Ex. .map = FALSE, but other formats might have a header. Since format of this file is not hard-coded, specify the format that your dataset comes with.
`ann.snpcol`	The column number in dataset's SNP position information file that contains SNP names/ids. For example in .map it is column 2.
`ann.poscol`	The column number in dataset's SNP position information file that contains position information. For example in .map it is column 4.
`ann.chrcol`	The column number in dataset's SNP position information file that contains chromosome number information. In .map format there is no such column, since there is a unique file per chromosome, thus default for this parameter is 0. In case if all position information is included in one single file for all/many chromosomes, specify which column corresponds to chromosome number.
`pos.hap.nonsnp`	The number of non-SNP leading columns in hapmap's data file. In .hap.gz it is 2.
`out.name.subdir`	The name of subdirectory structure to be created for output for this sequence of chromosomes and positions. Note: this folder name MUST be different for each different set of chromosome and position boundaries triplets.
`out.prefix`	The beginning of output file names.
`rsq.thresh`	Threshold for RSQ of MaCH's imputation. Recommended default is 0.5.
`num.iters`	The number of iterations MaCH should make in its first step to estimate its model parameters.
`hapmapformat`	The type of haplotype data format: 1000G haplotype dataset has .snps file with one column, so `hapmapformat` defaults to FALSE. Another dataset format listing SNPs (.legend.txt) has 4 columns - change `hapmapformat` to TRUE.
`mach.loc`	The location directory where "mach" executable can be found.

The input files for this function are inteded to be from different folders of the subdirectory structure used in preprocessing steps (see pre0.dir.create). The dataset's SNP (.dat) and data (.ped) information are intended to come from d3 (d03_removed); whereas the dataset's position information (.map) can be obtained from d1 (d01_plink) subdirectory. The hapmap files are huge and can be used by many datasets, thus there is no need to keep a copy of them in our subdirectory structure for each dataset. Note: if the hapmap file that specifies SNP information ALSO lists their position information, simply provide that file (and it's column format) to this function twice (as prefix.pos.snp and prefix.pos.ann). This function is meant to begin from early pre-processing steps, re-run MaCH with hapmap on desired regions, then combine CASE with CONTROL, and call all the pre-processing functions in sequence up until pre6.merge.genos. At the end, the output will be a single file ready to be called by MOSS run1.moss. A new convenient subdirectory structure will be created, similar to pre0.dir.create within new directory out.name.subdir. This function requires two sets of data: user's dataset and reference haplotypes. There are many hapmap libraries for download from the web, so this function tries to be as general as possible to allow users to give column information about the format. MaCH also needs to understand the given hapmap format. The defaults are set for 1000G Phase I(a) from MaCH's website: http://www.sph.umich.edu/csg/abecasis/MaCH/download/1000G-PhaseI-Interim.html. Note: the data file (.hap.gz) is expected to be zipped. However please unzip the .annotation.txt file before calling this function. The first thing this function would do is extract the given position intervals from user's datafiles and from haplotype files. This would make both files smaller so that running MaCH is feasible. MaCH will be run on CASE and CONTROL data files separately. After MaCH is run with hapmap, most of the predicted SNPs would have very low RSQ score, thus out of thousands of SNPs that are within the region in hapmap file, only hundreds will be actually reliable. This function prunes out all the SNPs with RSQ score lower than rsq.thresh. Then CASE and CONTROL will be combined based on common remaining SNPs. Then the function will run the two preprocessing functions (pre5.genos2numeric.batch, pre6.merge.genos) to output the final ready-to-use file.