compress: Compress phased genotype data

ghap.compressR Documentation

Compress phased genotype data

Description

This function takes phased genotype data and converts them into a compressed binary format.

Usage

  ghap.compress(input.file = NULL, out.file,
                samples.file = NULL, markers.file = NULL,
                phase.file = NULL, batchsize = NULL,
                ncores = 1, verbose = TRUE)

Arguments

If all input files share the same prefix, the user can use the following shortcut options:

input.file

Prefix for input files.

out.file

Output file name.

For backward compatibility, the user can still point to input files separately:

samples.file

Individual information.

markers.file

Variant map information.

phase.file

Phased genotype matrix.

To turn compression progress-tracking on or off, or to control parallelization of the task please use:

batchsize

A numeric value controlling the number of markers to be compressed and written to output at a time (default = nmarkers/10).

ncores

A numeric value specifying the number of cores to be used in parallel computing (default = 1).

verbose

A logical value specfying whether log messages should be printed (default = TRUE).

Details

The supported input format is composed of three files with suffix:

  • .samples: space-delimited file without header containing two mandatory columns: Population and ID. Please notice that the Population column serves solely for the purpose of grouping samples, so the user can define any arbitrary family/cluster/subgroup and use as a "population" tag. This file may further contain three additional columns, which are optional: Sire, Dam and Sex (with code 1 = M and 2 = F). Values "0" and "NA" in these additional columns are treated as missing values.

  • .markers: space-delimited file without header containing five mandatory columns: Chromosome, Marker, Position (in bp), Reference Allele (A0) and Alternative Allele (A1). Markers should be sorted by chromosome and position. Repeated positions are tolerated, but the user is warned of their presence in the data. Optionally, the user may provide a file containing an additional column with genetic positions (in cM), which has to be placed between the base pair position and the reference allele columns.

  • .phase: space-delimited file without header containing the phased genotype matrix. The dimension of the matrix is expected to be m x 2n, where m is the number of markers and n is the number of individuals (i.e., two columns per individual, representing the two phased chromosome alleles). Alleles must be coded as 0 or 1. No missing values are allowed, since imputation is assumed to be part of the phasing procedure.

The function outputs a binary file with suffix .phaseb. Each allele is stored as a bit in that file. Bits for any given marker are arranged in a sequence of bytes. Since each marker requires storage of 2*nsamples bits, the number of bytes consumed by a single marker in the output file is ceiling(2*nsamples). If the number of alleles is not a multiple of 8, bits in the remainder of the last byte are filled with 0. All functions in GHap were carefully designed to decode the bytes of a marker in such a way that trailing bits are ignored if present.

Author(s)

Yuri Tani Utsunomiya <ytutsunomiya@gmail.com>

Examples

 
# #### DO NOT RUN IF NOT NECESSARY ###
# 
# # Copy the example data in the current working directory
# exfiles <- ghap.makefile(dataset = "example",
#                          format = "raw",
#                          verbose = TRUE)
# file.copy(from = exfiles, to = "./")
# 
# ### RUN ###
# 
# # Compress phase data using prefix
# ghap.compress(input.file = "example",
#               out.file = "example")
# 
# # Compress phase data using file names
# ghap.compress(samples.file = "example.samples",
#               markers.file = "example.markers",
#               phase.file = "example.phase",
#               out.file = "example")


GHap documentation built on July 2, 2022, 1:07 a.m.