Create and check a GDS or NetCDF file with imputed dosages

Share:

Description

These functions create or check a GDS or NetCDF file and corresponding annotation for imputed dosages from IMPUTE2, BEAGLE, or MaCH.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
imputedDosageFile(input.files, filename, chromosome,
                  input.type=c("IMPUTE2", "BEAGLE", "MaCH"), 
                  input.dosage=FALSE,
                  output.type = c("dosage", "genotype"), 
                  file.type=c("gds", "ncdf"),
                  snp.annot.filename="dosage.snp.RData",
                  scan.annot.filename="dosage.scan.RData",
		  compress="LZMA_RA:1M",
                  compress.annot="LZMA_RA",
                  genotypeDim="scan,snp",
                  scan.df=NULL,
                  snp.exclude=NULL,
                  snp.id.start=1,
                  block.size=5000, 
                  prob.threshold=0.9, verbose=TRUE)
checkImputedDosageFile(genoData, snpAnnot, scanAnnot, 
                       input.files, chromosome,
                       input.type=c("IMPUTE2", "BEAGLE", "MaCH"), 
                       input.dosage=FALSE,
                       output.type=c("dosage", "genotype"),
                       snp.exclude=NULL, snp.id.start=1,
                       tolerance=1e-4, na.logfile=NULL,
                       block.size=5000,
                       prob.threshold=0.9,
                       verbose=TRUE)

Arguments

input.files

A character vector of input files. The first file should always be genotypes (either probabilities or dosages). Files for each input type should be as follows:

  • IMPUTE2: 1) .gens, 2) .samples

  • BEAGLE: 1) .grobs or .dose, 2) .markers

  • MaCH: 1) .mlprob or .mldose, 2) .mlinfo, 3) file with columns named "SNP" and "position" giving base pair position of all SNPs

filename

Character string with name of output GDS or NetCDF file.

chromosome

Chromosome corresponding to the SNPs in the genotype file. Character codes will be mapped to integer values as follows: "X"->23, "XY"->24, "Y"-> 25, "M","MT"->26.

input.type

Format of input files. Accepted file types are "IMPUTE2", "BEAGLE", and "MaCH".

input.dosage

Logical for whether the genotype file (input.files[1]) contains dosages. If FALSE (default), the genotype file is assumed to contain genotype probabilities.

file.type

The type of file to create ("gds" or "ncdf")

snp.annot.filename

Output .RData file for storing a SnpAnnotationDataFrame.

scan.annot.filename

Output .RData file for storing a ScanAnnotationDataFrame.

compress

The compression level for dosage variables in a GDS file (see add.gdsn for options.

compress.annot

The compression level for annotation variables in a GDS file (see add.gdsn for options.

genotypeDim

character string specifying genotype dimensions of output file. Either "snp,scan" or "scan,snp". "scan,snp" is usually much faster to create for GDS files, but "snp,scan" is required for NetCDF files.

scan.df

data frame specifying which samples to include in the output GDS files, with optional scanIDs already assigned. See details.

snp.exclude

vector of integers specifying which SNPs to exclude from the GDS file.

snp.id.start

Starting index for snpID.

block.size

Number of lines to read at once.

verbose

Logical for whether to print progress messages.

genoData

A GenotypeData object from a GDS file created with imputedDosageFile.

snpAnnot

The SnpAnnotationDataFrame created by imputedDosageFile

scanAnnot

The ScanAnnotationDataFrame created by imputedDosageFile

tolerance

tolerance for checking differences against input files

na.logfile

filename for recording snpID and scanID of missing dosages

output.type

output to record in gds file (either "dosage" or "genotype")

prob.threshold

if output.type="genotype", SNP/sample combinations with a maximum probabilityless than prob.threshold will be set to missing

Details

Input files can contain either imputed dosages or genotype probabilities, specified by the input.dosage flag. In either case, the GDS/NetCDF file will store dosage of the A allele in the "genotype" variable. All SNPs are assumed to be on the same chromosome, which is indicated by the chromosome argument.

If the input file contains genotype probabilities for all three genotypes, the dosage is set to missing if the genotype probability strings (before numerical conversion) are equal (e.g., (0,0,0), (0.33, 0.33, 0.33), or (-1, -1, -1)). The dosage is also normalized by the sum of all three genotype probabilities.

The scan.df argument allows the user to specify what samples should be included in the GDS files and an optional sampleID-scanID mapping. scan.df is a data frame with required column sampleID. The function attempts to match the given sampleID in the scan.df data frame with a unique sampleID in the input files. The format of sampleID is different for different input types:

  • IMPUTE2: "ID_1 ID_2" as given in the sample file, where IDs are separated by a space

  • BEAGLE: Column header names corresponding to that sample in .dose or .gprobs file

  • MaCH: The first column of the .mlprob or .mlprob file

The snp.names argument allows the user to specify the which SNPs should be included in the GDS files. However, snp.names must be in the same order as SNPs occur in the imputation files; this option therefore only allows selection of SNPs, not reordering of SNPs. The ordering is checked and an error is thrown if the SNP names are not in order, but due to the design of imputation files, this may not occur until well into the GDS file population. The user can specify the starting snpID by setting snp.id.start, and included SNPs are numbered sequentially starting with snp.id.start. For IMPUTE2 data, snp.names must correspond to the second column of the .gprobs file.

Minimal SNP and scan annotation are created from the input files and stored in RData format in snp.annot.filename and scan.annot.filename.

If requested with na.logfile, checkImputedDosageFile will output a file with scanIDs and snpIDs of missing genotype calls.

Either dosage or genotypes can be output using output.type. If dosage is requested, the dosages will be 2*AAprob + ABprob. If genotype is requested, the value will be set to the genotype with the maximum probability, unless all probabilities are less than prob.threshold. In that case, the genotype will be set to missing. SNPs with max probabilities that are the same for two genotypes (ie, AA=0.5, AB=0.5, BB=0) will also be set to missing.

Currently supported input file types are IMPUTE2, BEAGLE, and MaCH.

Author(s)

Adrienne Stilp, Stephanie Gogarten

References

IMPUTE2: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html

BEAGLE: http://faculty.washington.edu/browning/beagle/beagle.html

MaCH: http://www.sph.umich.edu/csg/abecasis/MACH/tour/imputation.html

See Also

createDataFile, GdsGenotypeReader, NcdfGenotypeReader, GenotypeData, assocRegression

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
gdsfile <- tempfile()
snpfile <- tempfile()
scanfile <- tempfile()
logfile <- tempfile()

# IMPUTE2
probfile <- system.file("extdata", "imputation", "IMPUTE2", "example.chr22.study.gens",
                        package="GWASdata")
sampfile <- system.file("extdata", "imputation", "IMPUTE2", "example.study.samples",
                        package="GWASdata")
imputedDosageFile(input.files=c(probfile, sampfile), filename=gdsfile, chromosome=22,
                  input.type="IMPUTE2", input.dosage=FALSE,
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)

gds <- GdsGenotypeReader(gdsfile)
scanAnnot <- getobj(scanfile)
snpAnnot <- getobj(snpfile)
genoData <- GenotypeData(gds, scanAnnot=scanAnnot, snpAnnot=snpAnnot)

checkImputedDosageFile(genoData, snpAnnot, scanAnnot,
                      input.files=c(probfile, sampfile), chromosome=22,
                      input.type="IMPUTE2", input.dosage=FALSE, na.logfile=logfile)

geno <- getGenotype(genoData)
getAlleleA(genoData)
getAlleleB(genoData)

log <- read.table(logfile)
head(log)

# association test with imputed dosages
scanAnnot$status <- sample(0:1, nrow(scanAnnot), replace=TRUE)
genoData <- GenotypeData(gds, scanAnnot=scanAnnot, snpAnnot=snpAnnot)
assoc <- assocRegression(genoData, outcome="status", model.type="logistic")
head(assoc)
close(genoData)


# BEAGLE - genotype probabilities
probfile <- system.file("extdata", "imputation", "BEAGLE", "example.hapmap.unphased.bgl.gprobs",
                      package="GWASdata")
markfile <- system.file("extdata", "imputation", "BEAGLE", "hapmap.markers",
                    package="GWASdata")
imputedDosageFile(input.files=c(probfile, markfile), filename=gdsfile, chromosome=22,
                  input.type="BEAGLE", input.dosage=FALSE, file.type="gds",
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)

# BEAGLE - dosage
dosefile <- system.file("extdata", "imputation", "BEAGLE", "example.hapmap.unphased.bgl.dose",
                    package="GWASdata")
imputedDosageFile(input.files=c(dosefile, markfile), filename=gdsfile, chromosome=22,
                  input.type="BEAGLE", input.dosage=TRUE, file.type="gds",
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)


# MaCH - genotype probabilities
probfile <- system.file("extdata", "imputation", "MaCH", "mach1.out.mlprob",
                        package="GWASdata")
markfile <- system.file("extdata", "imputation", "MaCH", "mach1.out.mlinfo",
                        package="GWASdata")
posfile <- system.file("extdata", "imputation", "MaCH", "mach1.snp.position",
                        package="GWASdata")
imputedDosageFile(input.files=c(probfile, markfile, posfile), filename=gdsfile, chromosome=22,
                  input.type="MaCH", input.dosage=FALSE, file.type="gds",
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)

# MaCH - dosage
dosefile <- system.file("extdata", "imputation", "MaCH", "mach1.out.mldose",
                        package="GWASdata")
imputedDosageFile(input.files=c(dosefile, markfile, posfile), filename=gdsfile, chromosome=22,
                  input.type="MaCH", input.dosage=TRUE,  file.type="gds",
                  snp.annot.filename=snpfile, scan.annot.filename=scanfile)

unlink(c(gdsfile, snpfile, scanfile))

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.