gdsSubset: Write a subset of data in a GDS file to a new GDS file

View source: R/gdsSubset.R

gdsSubsetR Documentation

Write a subset of data in a GDS file to a new GDS file

Description

gdsSubset takes a subset of data (snps and samples) from a GDS file and write it to a new GDS file. gdsSubsetCheck checks that a GDS file is the desired subset of another GDS file.

Usage

gdsSubset(parent.gds, sub.gds,
          sample.include=NULL, snp.include=NULL,
          sub.storage=NULL,
          compress="LZMA_RA",
          block.size=5000,
          verbose=TRUE,
          allow.fork=FALSE)

gdsSubsetCheck(parent.gds, sub.gds,
               sample.include=NULL, snp.include=NULL,
               sub.storage=NULL,
               verbose=TRUE,
               allow.fork=FALSE)

Arguments

parent.gds

Name of the parent GDS file

sub.gds

Name of the subset GDS file

sample.include

Vector of sampleIDs to include in sub.gds

snp.include

Vector of snpIDs to include in sub.gds

sub.storage

storage type for the subset file; defaults to original storage type

compress

The compression level for variables in a GDS file (see add.gdsn for options.

block.size

for GDS files stored with scan,snp dimensions, the number of SNPs to read from the parent file at a time. Ignored for snp,scan dimensions.

verbose

Logical value specifying whether to show progress information.

allow.fork

Logical value specifying whether to enable multiple forks to access the gds file simultaneously.

Details

gdsSubset can select a subset of snps for all samples by setting snp.include, a subset of samples for all snps by setting sample.include, or a subset of snps and samples with both arguments. The GDS nodes "snp.id", "snp.position", "snp.chromosome", and "sample.id" are copied, as well as any 2-dimensional nodes. Other nodes are not copied. The attributes of the 2-dimensional nodes are also copied to the subset file. If sub.storage is specified, the subset gds file will have a different storage mode for any 2-dimensional array. In the special case where the 2-dimensional node has an attribute named "missing.value" and the sub.storage type is "bit2", the missing.value attribute for the subset node is automatically set to 3. At this point, no checking is done to ensure that the values will be properly stored with a different storage type, but gdsSubsetCheck will return an error if the values do not match. If the nodes in the GDS file are stored with scan,snp dimensions, then block.size allows you to loop over a block of SNPs at a time. If the nodes are stored with snp,scan dimensions, then the function simply loops over samples, one at a time.

gdsSubsetCheck checks that a subset GDS file has the expected SNPs and samples of the parent file. It also checks that attributes were similarly copied, except for the above-mentioned special case of missing.value for sub.storage="bit2".

Author(s)

Adrienne Stilp

See Also

gdsfmt, createDataFile

Examples

gdsfile <- system.file("extdata", "illumina_geno.gds", package="GWASdata")
gds <- GdsGenotypeReader(gdsfile)
sample.sel <- getScanID(gds, index=1:10)
snp.sel <- getSnpID(gds, index=1:100)
close(gds)

subfile <- tempfile()
gdsSubset(gdsfile, subfile, sample.include=sample.sel, snp.include=snp.sel)
gdsSubsetCheck(gdsfile, subfile, sample.include=sample.sel, snp.include=snp.sel)

file.remove(subfile)

smgogarten/GWASTools documentation built on Nov. 10, 2024, 9:54 p.m.