Description Motivation Definition of read and write maps Read and write maps in R Reading rotated CEL files Author(s)
This part defines read and write maps that can be used to remap cell indices before reading and writing data from and to file, respectively.
This package provides methods to create read and write (cell-index) maps from Affymetrix CDF files. These can be used to store the cell data in an optimal order so that when data is read it is read in contiguous blocks, which is faster.
In addition to this, read maps may also be used to read CEL files that have been "reshuffled" by other software. For instance, the dChip software (http://www.dchip.org/) rotates Affymetrix Exon, Tiling and Mapping 500K data. See example below how to read such data "unrotated".
For more details how cell indices are defined, see
2. Cell coordinates and cell indices
.
When reading data from file, it is faster to read the data in the order that it is stored compared with, say, in a random order. The main reason for this is that the read arm of the hard drive has to move more if data is not read consecutively. Same applies when writing data to file. The read and write cache of the file system may compensate a bit for this, but not completely.
In Affymetrix CEL files, cell data is stored in order of cell indices.
Moreover, (except for a few early chip types) Affymetrix randomizes
the locations of the cells such that cells in the same unit (probeset)
are scattered across the array.
Thus, when reading CEL data arranged by units using for instance
readCelUnits
(), the order of the cells requested is both random
and scattered.
Since CEL data is often queried unit by unit (except for some probe-level normalization methods), one can improve the speed of reading data by saving data such that cells in the same unit are stored together. A write map is used to remap cell indices to file indices. When later reading that data back, a read map is used to remap file indices to cell indices. Read and write maps are described next.
Consider cell indices i=1, 2, ..., N*K and file indices j=1, 2, ..., N*K. A read map is then a bijective (one-to-one) function h() such that
i = h(j),
and the corresponding write map is the inverse function h^{-1}() such that
j = h^{-1}(i).
Since the mapping is required to be bijective, it holds that i = h(h^{-1}(i)) and that j = h^{-1}(h(j)). For example, consider the "reversing" read map function h(j)=N*K-j+1. The write map function is h^{-1}(i)=N*K-i+1. To verify the bijective property of this map, we see that h(h^{-1}(i)) = h(N*K-i+1) = N*K-(N*K-i+1)+1 = i as well as h^{-1}(h(j)) = h^{-1}(N*K-j+1) = N*K-(N*K-j+1)+1 = j.
In this package, read and write maps are represented as integer
vector
s of length N*K with unique elements in
\{1,2,...,N*K\}.
Consider cell and file indices as in previous section.
For example, the "reversing" read map in previous section can be represented as
1 2 | readMap <- (N*K):1
|
Given a vector
j
of file indices, the cell indices are
the obtained as i = readMap[j]
.
The corresponding write map is
1 2 | writeMap <- (N*K):1
|
and given a vector
i
of cell indices, the file indices are
the obtained as j = writeMap[i]
.
Note also that the bijective property holds for this mapping, that is
i == readMap[writeMap[i]]
and i == writeMap[readMap[i]]
are both TRUE
.
Because the mapping is bijective, the write map can be calculated from the read map by:
1 2 | writeMap <- order(readMap)
|
and vice versa:
1 2 | readMap <- order(writeMap)
|
Note, the invertMap
() method is much faster than order()
.
Since most algorithms for Affymetrix data are based on probeset (unit)
models, it is natural to read data unit by unit. Thus, to optimize the
speed, cells should be stored in contiguous blocks of units.
The methods readCdfUnitsWriteMap
() can be used to generate a
write map from a CDF file such that if the units are read in
order, readCelUnits
() will read the cells data in order.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | Find any CDF file
cdfFile <- findCdf()
# Get the order of cell indices
indices <- readCdfCellIndices(cdfFile)
indices <- unlist(indices, use.names=FALSE)
# Get an optimal write map for the CDF file
writeMap <- readCdfUnitsWriteMap(cdfFile)
# Get the read map
readMap <- invertMap(writeMap)
# Validate correctness
indices2 <- readMap[indices] # == 1, 2, 3, ..., N*K
|
Warning, do not misunderstand this example. It can not be used improve the reading speed of default CEL files. For this, the data in the CEL files has to be rearranged (by the corresponding write map).
It might be that a CEL file was rotated by another software, e.g. the dChip software rotates Affymetrix Exon, Tiling and Mapping 500K arrays 90 degrees clockwise, which remains rotated when exported as CEL files. To read such data in a non-rotated way, a read map can be used to "unrotate" the data. The 90-degree clockwise rotation that dChip effectively uses to store such data is explained by:
1 2 3 4 5 6 7 8 |
Thus, to read this data "unrotated", use the following read map:
1 2 3 |
Henrik Bengtsson
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.