genotype.Illumina: Preprocessing and genotyping of Illumina Infinium II arrays.

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/crlmm-illumina.R

Description

Preprocessing and genotyping of Illumina Infinium II arrays.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
genotype.Illumina(sampleSheet=NULL, arrayNames=NULL, ids=NULL, path=".",
      arrayInfoColNames=list(barcode="SentrixBarcode_A", position="SentrixPosition_A"),
      highDensity=FALSE, sep="_", fileExt=list(green="Grn.idat", red="Red.idat"), XY=NULL, anno, genome, 
      call.method="crlmm", trueCalls=NULL, cdfName, copynumber=TRUE, batch=NULL, saveDate=FALSE, stripNorm=TRUE, 
      useTarget=TRUE, quantile.method="between", nopackage.norm="quantile", mixtureSampleSize=10^5, fitMixture=TRUE,
      eps=0.1, verbose = TRUE, seed = 1, sns, probs = rep(1/3, 3), DF = 6, SNRMin = 5, 
      recallMin = 10, recallRegMin = 1000, gender = NULL, returnParams = TRUE, badSNP = 0.7)

crlmmIllumina(sampleSheet=NULL, arrayNames=NULL, ids=NULL, path=".",
      arrayInfoColNames=list(barcode="SentrixBarcode_A", position="SentrixPosition_A"),
      highDensity=FALSE, sep="_", fileExt=list(green="Grn.idat", red="Red.idat"), XY=NULL, anno, genome, 
      call.method="crlmm", trueCalls=NULL, cdfName, copynumber=TRUE, batch=NULL, saveDate=FALSE, stripNorm=TRUE, 
      useTarget=TRUE, quantile.method="between", nopackage.norm="quantile", mixtureSampleSize=10^5, fitMixture=TRUE,
      eps=0.1, verbose = TRUE, seed = 1, sns, probs = rep(1/3, 3), DF = 6, SNRMin = 5, 
      recallMin = 10, recallRegMin = 1000, gender = NULL, returnParams = TRUE, badSNP = 0.7)

Arguments

sampleSheet

data.frame containing Illumina sample sheet information (for required columns, refer to BeadStudio Genotyping guide - Appendix A).

arrayNames

character vector containing names of arrays to be read in. If NULL, all arrays that can be found in the specified working directory will be read in.

ids

vector containing ids of probes to be read in. If NULL all probes found on the first array are read in.

path

character string specifying the location of files to be read by the function

arrayInfoColNames

(used when sampleSheet is specified) list containing elements 'barcode' which indicates column names in the sampleSheet which contains the arrayNumber/barcode number and 'position' which indicates the strip number. In older style sample sheets, this information is combined (usually in a column named 'SentrixPosition') and this should be specified as list(barcode=NULL, position="SentrixPosition")

highDensity

logical (used when sampleSheet is specified). If TRUE, array extensions '\_A', '\_B' in sampleSheet are replaced with 'R01C01', 'R01C02' etc.

sep

character string specifying separator used in .idat file names.

fileExt

list containing elements 'Green' and 'Red' which specify the .idat file extension for the Cy3 and Cy5 channels.

XY

NChannelSet containing X and Y intensities.

anno

data.frame containing SNP annotation information from manifest and additional columns 'isSnp', 'position', 'chromosome' and 'featureNames'. For use when cdfName='nopackage'

genome

character string specifying which genome is used in annotation

call.method

character string specifying the genotype calling algorithm to use ('crlmm' or 'krlmm').

trueCalls

matrix specifying known Genotype calls(can contain some NAs) for a subset of samples and features (1 - AA, 2 - AB, 3 - BB).

cdfName

annotation package (see also validCdfNames) or 'nopackage' when combined with 'krlmm', an anno data.frame and genome.

copynumber

'logical.' Whether to store copy number intensities with SNP output.

batch

character vector indicating the batch variable. Must be the same length as the number of samples. See details.

saveDate

'logical'. Should the dates from each .idat be saved with sample information?

stripNorm

'logical'. Should the data be strip-level normalized?

useTarget

'logical' (only used when stripNorm=TRUE). Should the reference HapMap intensities be used in strip-level normalization?

quantile.method

character string specifying the quantile normalization method to use ('within' or 'between' channels).

nopackage.norm

character string specifying normalization to be used when cdfName='nopackage'. Options are 'none', 'quantile' (within channel, between array) and 'loess'.

mixtureSampleSize

Sample size to be use when fitting the mixture model.

fitMixture

'logical.' Whether to fit per-array mixture model.

eps

Stop criteria.

verbose

'logical.' Whether to print descriptive messages during processing.

seed

Seed to be used when sampling. Useful for reproducibility

sns

The sample identifiers. If missing, the default sample names are basename(filenames)

probs

'numeric' vector with priors for AA, AB and BB.

DF

'integer' with number of degrees of freedom to use with t-distribution.

SNRMin

'numeric' scalar defining the minimum SNR used to filter out samples.

recallMin

Minimum number of samples for recalibration.

recallRegMin

Minimum number of SNP's for regression.

gender

integer vector ( male = 1, female = 2 ) or missing, with same length as filenames. If missing, the gender is predicted.

returnParams

'logical'. Return recalibrated parameters from crlmm.

badSNP

'numeric'. Threshold to flag as bad SNP (affects batchQC)

Details

genotype.Illumina (or equivalently crlmmIllumina) is a wrapper of the crlmm function for genotyping. Differences include (1) that the copy number probes (if present) are also quantile-normalized and (2) the class of object returned by this function, CNSet, is needed for subsequent copy number estimation. Note that the batch variable (a character string) has no effect on the normalization or genotyping steps. Rather, batch is required in order to initialize a CNSet container with the appropriate dimensions.

The new 'krlmm' option is available for certain chip types. Optional argument trueCalls matrix contains known Genotype calls (1 - AA, 2 - AB, 3 - BB) for a subset of samples and features. This will used to compute KRLMM coefficients by calling vglm function from VGAM package.

The 'krlmm' method makes use of functions provided in parallel package to speed up the process. It by default initialises up to 8 clusters. This is configurable by setting up an option named "krlmm.cores", e.g. options("krlmm.cores" = 16).

In general, a chip specific annotation package is required to use the genotype.Illumina function. If this is not available (newer chip types or custom chips often don't have a chip-specific package available on Bioconductor), consider using cdfName='nopackage' and specifying anno and genome, which runs 'krlmm' on the samples available. Here anno is a data.frame read in from the relevant chip-specific manifest, which must have additional columns 'isSnp' which is a logical that indicates whether a probe is polymorphic or not, 'position', 'chromosome' and 'featureNames' that give the location on the chromosome and SNP name.

Value

A SnpSuperSet instance.

Author(s)

Matt Ritchie, Cynthia Liu, Zhiyin Dai

References

Ritchie ME, Carvalho BS, Hetrick KN, Tavar\'e S, Irizarry RA. R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips. Bioinformatics. 2009 Oct 1;25(19):2621-3.

Liu R, Dai Z, Yeager M, Irizarry RA1, Ritchie ME. KRLMM: an adaptive genotype calling method for common and low frequency variants. BMC Bioinformatics. 2014 May 23;15:158.

Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007 Apr;8(2):485-99. Epub 2006 Dec 22. PMID: 17189563.

Carvalho BS, Louis TA, Irizarry RA. Quantifying uncertainty in genotype calls. Bioinformatics. 2010 Jan 15;26(2):242-9.

See Also

ocSamples, ldOpts

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
## Not run: 
	# example for 'crlmm' option
	library(ff)
	library(crlmm)
	## to enable paralellization, set to TRUE
	if(FALSE){
		library(snow)
		library(doSNOW)
		## with 10 workers
		cl <- makeCluster(10, type="SOCK")
		registerDoSNOW(cl)
	}
	## path to idat files
	datadir <- "/thumper/ctsa/snpmicroarray/illumina/IDATS/370k"
	## read in your samplesheet
	samplesheet = read.csv(file.path(datadir, "HumanHap370Duo_Sample_Map.csv"), header=TRUE, as.is=TRUE)
	samplesheet <- samplesheet[-c(28:46,61:75,78:79), ]
	arrayNames <- file.path(datadir, unique(samplesheet[, "SentrixPosition"]))
	arrayInfo <- list(barcode=NULL, position="SentrixPosition")
	cnSet <- genotype.Illumina(sampleSheet=samplesheet,
				   arrayNames=arrayNames,
				   arrayInfoColNames=arrayInfo,
				   cdfName="human370v1c",
				   batch=rep("1", nrow(samplesheet)))


## End(Not run)
## Not run: 
	# example for 'krlmm' option
	library(crlmm)
	library(ff)
	# line below is an optional step for krlmm to initialise 16 workers 
	# options("krlmm.cores" = 16)
	# read in raw X and Y intensities output by GenomeStudio's GenCall genotyping module
	XY = readGenCallOutput(c("HumanOmni2-5_4v1_FinalReport_83TUSCAN.csv","HumanOmni2-5_4v1_FinalReport_88CHB-JPT.csv"),
				cdfName="humanomni25quadv1b",
				verbose=TRUE)
	krlmmResult = genotype.Illumina(XY=XY, 
		      			cdfName=ThiscdfName, 
					call.method="krlmm", 
					verbose=TRUE)

	# example for 'krlmm' option with known genotype call for some SNPs and samples
	library(VGAM)
	hapmapCalls = load("hapmapCalls.rda")
	# hapmapCalls should have rownames and colnames corresponding to XY featureNames and sampleNames
	krlmmResult = genotype.Illumina(XY=XY,
					cdfName=ThiscdfName, 
					call.method="krlmm", 
					trueCalls=hapmapCalls, 
					verbose=TRUE)		

## End(Not run)

benilton/crlmm documentation built on Dec. 27, 2019, 12:12 a.m.