preprocess.illumina.idat: Preprocess Illumina gene expression iDAT files.

Description Usage Arguments Value Reading iDAT files Array manifest files Probe naming Probe collapsing Background correct Memory usage TODO Author(s) References Examples

Description

This function can decrypt Illumina gene expression iDAT files (aka version 1 iDAT files). It will create Sample Probe Profile.txt and Control Probe Profile.txt files, similar to Illumina GenomeStudio version 1.8.0 [1]. We have made every effort to reproduce the GenomeStudio output, down to the Detection P-Value calculation, the order of the rows, the background correction procedure (which is only applied to gene-probes), the output file headers sample and ProbeID naming, however we cannot guarantee that the output will be identical to that produced by GenomeStudio.

Usage

1
2
3
4
5
6
  preprocess.illumina.idat(files = NULL, path = NULL,
    zipfile = NULL, manifestfile = NULL, clmfile = NULL,
    probeID = c("ProbeID", "ArrayAddressID", "Sequence", "NuID"),
    collapseMode = c("none", "max", "median", "mean"),
    backgroundCorrect = FALSE, outdir = NULL,
    prefix = NULL, verbose = FALSE, memory = "-Xmx1024m")

Arguments

files

A character vector of at least one file name

path

The path to the directories where the files are. This defaults to the current working directory.

zipfile

the path to a zip file containing idat files. This over-rides the files argument.

manifestfile

The full path to the Array manifest file in TXT format. See list_illumina_manifest_files and then download_illumina_manifest_file to download manifest files directly from Illumina.

clmfile

NULL, or the path to a GenePattern-CLM file [5]. This represents a mechanism for renaming, and reordering the samples in the resulting object. Column 1 is the idat name (not path), Column 2 is the desired sample name, column 3 is the biological class of each sample. The order of the rows specifies the order of the samples in the result.

probeID

This controls which value to identify each probe by. Allowable values are “ArrayAddressID”, “ProbeID”, “Sequence”, “NuID”.

collapseMode

Collapse probes to genes using the specified mode. Valid values are “none” (the default), “max”, “median” and “mean” (the GenomeStudio default).

backgroundCorrect

logical, if TRUE then peform background correction on only the gene-level probes; if FALSE, do no correction.

outdir

The full path to the output directory. If NULL the current working directory is used.

prefix

An character[1] used as a file name prefix for the files that will be created. “NONE” (the default) means no prefix will be used.

verbose

logical, if TRUE, print informative messages

memory

The maximum Java memory heapsize. eg: "-Xmx1024m", or "-Xmx4g", which reserves 1GB or 4G of RAM, respectively. default="-Xmx1024m"

Value

If collapseMode=“NONE”, then invisbly return a character[2] containing the file paths of the “Sample Probe Profile.txt” and “Control Probe Profile.txt” files. Otherwise, invisbly return a character[2] containing the file paths of the “Sample Gene Profile.txt” and “Control Gene Profile.txt” files.

Reading iDAT files

iDAT files can be specified in 3 main ways, either as a character vector of file paths, via the files parameter; (2) a character vector of filenames, and the path to the directy containing the files via files and path arguments; (3) via a single zip file containing iDAT files, via the zipfile argument. The zip file may contain subfolders, though each array should have a unique name, regardless of where it sits in the zipfile structure.

Array manifest files

Each array version and revision has a specific manifest file; these are required for decoding idat files, and can be queried, or downloaded via list_illumina_manifest_files, and download_illumina_manifest_file, respectively. Failing that, you can download them manually directly from Illumina [2,3]. It is important to use the TXT version, not the BGX version. You can use a newer Release of the manifest file as long as you get the right array type and Version. For example, HumanHT-12_V3 arrays can use the HumanHT-12_V3_0_R2_11283641_A.txt or HumanHT-12_V3_0_R3_11283641_A.txt manifest files.

Probe naming

There is some confusion about the naming of Illumina Probes. From the array manifest files, you can uniquely identify each probe by either the Illumina Probe ID (eg ILMN_1802252), the Array Address ID (eg 2640048), the Probe_Sequence, or the NuID [3]. Illumina's GenomeStudio uses the ArrayAddressID in the ProbeID column; I think the Probe_ID or NuID are better alternatives. If you choose NuID, note that unlike addNuID2lumi, which uses pre-built mapping tables, we calculate NuID's directly from the probe sequences within the manifest file.

Probe collapsing

Some genes are represented by multiple probes. You can either obtain values at the probe-level, or the gene-level using the ‘collapseMode’ parameter:

  1. “none”The default: do no collapsing. Produces Sample Probe Profile.txt and Control Probe Profile.txt files

  2. “max”Choose the probe with the highest AVG_Intensity value

  3. “median”Choose the probe with the median AVG_Intensity value

  4. “mean”Determine the mean fluorescence across all probes

Choosing either “max”, “median” or “mean” creates “Sample Gene Profile.txt” and “Control Gene Profile.txt” files. “mean” is the method that GenomeStudio uses, however we were unable to reproduce the mean values for “numBeads”, “Detection PValue” and “Probe_STDERR” - we opted for taking the mean of these values. When using “median” and there are an even number of probes for a gene, there is no single median probe, so from the 2 median probes, we take the conservative option of choosing the average AVG_Intensity, the smallest NumBeads, and the largest Probe_STDERR and Detection PVal, for those 2 median probes

Background correct

Illumina GenomeStudio offers a background correction option. This estimates the background, and subtracts it from only the gene-level probes; ie control probes are never background corrected. The value is the mean AVG_Signal level of all the negative control probes on each array.

Memory usage

At the heart of this function is a Java program, which requires that you specify an appropriate maximum amount of memory. The default is -Xmx1024m, which reserves 1 GB RAM. We have analysed 85 Human HT12 arrays using -Xmx2048m. If you get the following error: ‘Exception in thread "main" java.lang.OutOfMemoryError: Java heap space’ Then you need to increase the amount of RAM, upto the maximum available in your system. If you still get the error, then you need a 64-bit system with lots of RAM.
As a rough guide, on my machine, for Human HT12 arrays, the amount of RAM required is approximately 333MB RAM + 19MB per array.

TODO

Author(s)

Mark Cowley [email protected], with contributions from Mark Pinese, David Eby.

References

[1] Genome Studio Gene Expression Module User Guide, version 1.0, Illumina
[2] http://www.switchtoi.com/annotationfiles.ilmn
[3] http://www.switchtoi.com/annotationprevfiles.ilmn
[4] Du, P., Kibbe, W. A., & Lin, S. M. (2007). nuID: a universal naming scheme of oligonucleotides for illumina, affymetrix, and other microarrays Biol Direct, 2, 16. (doi:10.1186/1745-6150-2-16)
[5] http://www.broadinstitute.org/cancer/software/genepattern/tutorial/gp_fileformats.html#CLM

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# iDAT files+path as input
path <- system.file("extdata", package="lumidat")
outdir <- tempdir()
files <- c("5356583020_A_Grn.idat", "5356583020_B_Grn.idat")
manifestfile <- system.file("extdata", "HumanHT-12_V3_0_R1_99999999.txt", package="lumidat")
files <- preprocess.illumina.idat(files, path, manifestfile=manifestfile, probeID="NuID", outdir=outdir)
files

# zipfile as input
path <- system.file("extdata", package="lumidat")
files <- c("5356583020_A_Grn.idat", "5356583020_B_Grn.idat")
zipfile <- tempfile(fileext=".zip")
zip(zipfile, file.path(path, files), flags="-r9Xq")
files <- preprocess.illumina.idat(zipfile=zipfile, manifestfile=manifestfile, probeID="NuID", outdir=outdir)
files

# CLM file
path <- system.file("extdata", package="lumidat")
outdir <- tempdir()
files <- c("5356583020_A_Grn.idat", "5356583020_B_Grn.idat")
manifestfile <- system.file("extdata", "HumanHT-12_V3_0_R1_99999999.txt", package="lumidat")
clmfile <- system.file("extdata", "5356583020.clm", package="lumidat")
files <- preprocess.illumina.idat(files, path, manifestfile=manifestfile, clmfile=clmfile, probeID="NuID", outdir=outdir)
readLines(files[2], 2)
# [1] "TargetID\tProbeID\tA.AVG_Signal\tA.BEAD_STDERR\tA.Avg_NBEADS\tA.Detection Pval\tB.AVG_Signal\tB.BEAD_STDERR\tB.Avg_NBEADS\tB.Detection Pval"
# [2] "biotin\t0gwIEN5441dbo3j27A\t16902.39\t5062.077\t85\t1.0000000\t16668.27\t6532.465\t60\t1.0000000"

## Not run: 
# Get the Human HT12 V4 manifest file:
manifestfile <- download_illumina_manifest_file("HumanHT-12_V4_0_R2_15002873_B", "txt")

## End(Not run)

drmjc/lumidat documentation built on May 15, 2019, 2:23 p.m.