imputedDosageFile: Create and check a GDS or NetCDF file with imputed dosages
In GWASTools: Tools for Genome Wide Association Studies

Description Usage Arguments Details Author(s) References See Also Examples

These functions create or check a GDS or NetCDF file and corresponding annotation for imputed dosages from IMPUTE2, BEAGLE, or MaCH.

imputedDosageFile(input.files, filename, chromosome,
                  input.type=c("IMPUTE2", "BEAGLE", "MaCH"), 
                  input.dosage=FALSE,
                  output.type = c("dosage", "genotype"), 
                  file.type=c("gds", "ncdf"),
                  snp.annot.filename="dosage.snp.RData",
                  scan.annot.filename="dosage.scan.RData",
		  compress="LZMA_RA:1M",
                  compress.annot="LZMA_RA",
                  genotypeDim="scan,snp",
                  scan.df=NULL,
                  snp.exclude=NULL,
                  snp.id.start=1,
                  block.size=5000, 
                  prob.threshold=0.9, verbose=TRUE)
checkImputedDosageFile(genoData, snpAnnot, scanAnnot, 
                       input.files, chromosome,
                       input.type=c("IMPUTE2", "BEAGLE", "MaCH"), 
                       input.dosage=FALSE,
                       output.type=c("dosage", "genotype"),
                       snp.exclude=NULL, snp.id.start=1,
                       tolerance=1e-4, na.logfile=NULL,
                       block.size=5000,
                       prob.threshold=0.9,
                       verbose=TRUE)

`input.files`	A character vector of input files. The first file should always be genotypes (either probabilities or dosages). Files for each input type should be as follows: IMPUTE2: 1) .gens, 2) .samples BEAGLE: 1) .grobs or .dose, 2) .markers MaCH: 1) .mlprob or .mldose, 2) .mlinfo, 3) file with columns named "SNP" and "position" giving base pair position of all SNPs
`filename`	Character string with name of output GDS or NetCDF file.
`chromosome`	Chromosome corresponding to the SNPs in the genotype file. Character codes will be mapped to integer values as follows: "X"->23, "XY"->24, "Y"-> 25, "M","MT"->26.
`input.type`	Format of input files. Accepted file types are "IMPUTE2", "BEAGLE", and "MaCH".
`input.dosage`	Logical for whether the genotype file (`input.files[1]`) contains dosages. If `FALSE` (default), the genotype file is assumed to contain genotype probabilities.
`file.type`	The type of file to create ("gds" or "ncdf")
`snp.annot.filename`	Output .RData file for storing a `SnpAnnotationDataFrame`.
`scan.annot.filename`	Output .RData file for storing a `ScanAnnotationDataFrame`.
`compress`	The compression level for dosage variables in a GDS file (see `add.gdsn` for options.
`compress.annot`	The compression level for annotation variables in a GDS file (see `add.gdsn` for options.
`genotypeDim`	character string specifying genotype dimensions of output file. Either "snp,scan" or "scan,snp". "scan,snp" is usually much faster to create for GDS files, but "snp,scan" is required for NetCDF files.
`scan.df`	data frame specifying which samples to include in the output GDS files, with optional scanIDs already assigned. See details.
`snp.exclude`	vector of integers specifying which SNPs to exclude from the GDS file.
`snp.id.start`	Starting index for snpID.
`block.size`	Number of lines to read at once.
`verbose`	Logical for whether to print progress messages.
`genoData`	A `GenotypeData` object from a GDS file created with `imputedDosageFile`.
`snpAnnot`	The `SnpAnnotationDataFrame` created by `imputedDosageFile`
`scanAnnot`	The `ScanAnnotationDataFrame` created by `imputedDosageFile`
`tolerance`	tolerance for checking differences against input files
`na.logfile`	filename for recording snpID and scanID of missing dosages
`output.type`	output to record in gds file (either `"dosage"` or `"genotype"`)
`prob.threshold`	if `output.type="genotype"`, SNP/sample combinations with a maximum probabilityless than `prob.threshold` will be set to missing

Input files can contain either imputed dosages or genotype probabilities, specified by the input.dosage flag. In either case, the GDS/NetCDF file will store dosage of the A allele in the "genotype" variable. All SNPs are assumed to be on the same chromosome, which is indicated by the chromosome argument.

If the input file contains genotype probabilities for all three genotypes, the dosage is set to missing if the genotype probability strings (before numerical conversion) are equal (e.g., (0,0,0), (0.33, 0.33, 0.33), or (-1, -1, -1)). The dosage is also normalized by the sum of all three genotype probabilities.

The scan.df argument allows the user to specify what samples should be included in the GDS files and an optional sampleID-scanID mapping. scan.df is a data frame with required column sampleID. The function attempts to match the given sampleID in the scan.df data frame with a unique sampleID in the input files. The format of sampleID is different for different input types:

IMPUTE2: "ID_1 ID_2" as given in the sample file, where IDs are separated by a space
BEAGLE: Column header names corresponding to that sample in .dose or .gprobs file
MaCH: The first column of the .mlprob or .mlprob file

The snp.names argument allows the user to specify the which SNPs should be included in the GDS files. However, snp.names must be in the same order as SNPs occur in the imputation files; this option therefore only allows selection of SNPs, not reordering of SNPs. The ordering is checked and an error is thrown if the SNP names are not in order, but due to the design of imputation files, this may not occur until well into the GDS file population. The user can specify the starting snpID by setting snp.id.start, and included SNPs are numbered sequentially starting with snp.id.start. For IMPUTE2 data, snp.names must correspond to the second column of the .gprobs file.

Minimal SNP and scan annotation are created from the input files and stored in RData format in snp.annot.filename and scan.annot.filename.

If requested with na.logfile, checkImputedDosageFile will output a file with scanIDs and snpIDs of missing genotype calls.

Either dosage or genotypes can be output using output.type. If dosage is requested, the dosages will be 2*AAprob + ABprob. If genotype is requested, the value will be set to the genotype with the maximum probability, unless all probabilities are less than prob.threshold. In that case, the genotype will be set to missing. SNPs with max probabilities that are the same for two genotypes (ie, AA=0.5, AB=0.5, BB=0) will also be set to missing.

Currently supported input file types are IMPUTE2, BEAGLE, and MaCH.

Adrienne Stilp, Stephanie Gogarten

IMPUTE2: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html

BEAGLE: http://faculty.washington.edu/browning/beagle/beagle.html

MaCH: http://www.sph.umich.edu/csg/abecasis/MACH/tour/imputation.html

createDataFile, GdsGenotypeReader, NcdfGenotypeReader, GenotypeData, assocRegression

gdsfile <- tempfile()
snpfile <- tempfile()
scanfile <- tempfile()
logfile <- tempfile()

# IMPUTE2
probfile <- system.file("extdata", "imputation", "IMPUTE2", "example.chr22.study.gens",
                        package="GWASdata")
sampfile <- system.file("extdata", "imputation", "IMPUTE2", "example.study.samples",
                        package="GWASdata")
imputedDosageFile(input.files=c(probfile, sampfile), filename=gdsfile, chromosome=22,
                  input.type="IMPUTE2", input.dosage=FALSE,
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)

gds <- GdsGenotypeReader(gdsfile)
scanAnnot <- getobj(scanfile)
snpAnnot <- getobj(snpfile)
genoData <- GenotypeData(gds, scanAnnot=scanAnnot, snpAnnot=snpAnnot)

checkImputedDosageFile(genoData, snpAnnot, scanAnnot,
                      input.files=c(probfile, sampfile), chromosome=22,
                      input.type="IMPUTE2", input.dosage=FALSE, na.logfile=logfile)

geno <- getGenotype(genoData)
getAlleleA(genoData)
getAlleleB(genoData)

log <- read.table(logfile)
head(log)

# association test with imputed dosages
scanAnnot$status <- sample(0:1, nrow(scanAnnot), replace=TRUE)
genoData <- GenotypeData(gds, scanAnnot=scanAnnot, snpAnnot=snpAnnot)
assoc <- assocRegression(genoData, outcome="status", model.type="logistic")
head(assoc)
close(genoData)


# BEAGLE - genotype probabilities
probfile <- system.file("extdata", "imputation", "BEAGLE", "example.hapmap.unphased.bgl.gprobs",
                      package="GWASdata")
markfile <- system.file("extdata", "imputation", "BEAGLE", "hapmap.markers",
                    package="GWASdata")
imputedDosageFile(input.files=c(probfile, markfile), filename=gdsfile, chromosome=22,
                  input.type="BEAGLE", input.dosage=FALSE, file.type="gds",
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)

# BEAGLE - dosage
dosefile <- system.file("extdata", "imputation", "BEAGLE", "example.hapmap.unphased.bgl.dose",
                    package="GWASdata")
imputedDosageFile(input.files=c(dosefile, markfile), filename=gdsfile, chromosome=22,
                  input.type="BEAGLE", input.dosage=TRUE, file.type="gds",
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)


# MaCH - genotype probabilities
probfile <- system.file("extdata", "imputation", "MaCH", "mach1.out.mlprob",
                        package="GWASdata")
markfile <- system.file("extdata", "imputation", "MaCH", "mach1.out.mlinfo",
                        package="GWASdata")
posfile <- system.file("extdata", "imputation", "MaCH", "mach1.snp.position",
                        package="GWASdata")
imputedDosageFile(input.files=c(probfile, markfile, posfile), filename=gdsfile, chromosome=22,
                  input.type="MaCH", input.dosage=FALSE, file.type="gds",
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)

# MaCH - dosage
dosefile <- system.file("extdata", "imputation", "MaCH", "mach1.out.mldose",
                        package="GWASdata")
imputedDosageFile(input.files=c(dosefile, markfile, posfile), filename=gdsfile, chromosome=22,
                  input.type="MaCH", input.dosage=TRUE,  file.type="gds",
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)

unlink(c(gdsfile, snpfile, scanfile))