read_bed: Read a genotype matrix in Plink BED format

View source: R/read_bed.R

read_bedR Documentation

Read a genotype matrix in Plink BED format

Description

This function reads genotypes encoded in a Plink-formatted BED (binary) file, returning them in a standard R matrix containing genotypes encoded numerically as dosages (values in c( 0, 1, 2, NA )). Each genotype per locus (m loci) and individual (n total) counts the number of reference alleles, or NA for missing data. No *.fam or *.bim files are read by this basic function. Since BED does not encode the data dimensions internally, these values must be provided by the user.

Usage

read_bed(
  file,
  names_loci = NULL,
  names_ind = NULL,
  m_loci = NA,
  n_ind = NA,
  ext = "bed",
  verbose = TRUE
)

Arguments

file

Input file path. *.bed extension may be omitted (will be added automatically if file doesn't exist but file.bed does). See ext option below.

names_loci

Vector of loci names, to become the row names of the genotype matrix. If provided, its length sets m_loci below. If NULL, the returned genotype matrix will not have row names, and m_loci must be provided.

names_ind

Vector of individual names, to become the column names of the genotype matrix. If provided, its length sets n_ind below. If NULL, the returned genotype matrix will not have column names, and n_ind must be provided.

m_loci

Number of loci in the input genotype table. Required if names_loci = NULL, as its value is not deducible from the BED file itself. Ignored if names_loci is provided.

n_ind

Number of individuals in the input genotype table. Required if names_ind = NULL, as its value is not deducible from the BED file itself. Ignored if names_ind is provided.

ext

The desired file extension (default "bed"). Ignored if file points to an existing file. Set to NA to force file to exist as-is.

verbose

If TRUE (default) function reports the path of the file being read (after autocompleting the extension).

Details

The code enforces several checks to validate data given the requested dimensions. Errors are thrown if file terminates too early or does not terminate after genotype matrix is filled. In addition, as each locus is encoded in an integer number of bytes, and each byte contains up to four individuals, bytes with fewer than four are padded. To agree with other software (plink2, BEDMatrix), byte padding values are ignored (may take on any value without causing errors).

This function only supports locus-major BED files, which are the standard for modern data. Format is validated via the BED file's magic numbers (first three bytes of file). Older BED files can be converted using Plink.

Value

The m-by-n genotype matrix.

See Also

read_plink() for reading a set of BED/BIM/FAM files.

geno_to_char() for translating numerical genotypes into more human-readable character encodings.

Plink BED format reference: https://www.cog-genomics.org/plink/1.9/formats#bed

Examples

# first obtain data dimensions from BIM and FAM files
# all file paths
file_bed <- system.file("extdata", 'sample.bed', package = "genio", mustWork = TRUE)
file_bim <- system.file("extdata", 'sample.bim', package = "genio", mustWork = TRUE)
file_fam <- system.file("extdata", 'sample.fam', package = "genio", mustWork = TRUE)
# read annotation tables
bim <- read_bim(file_bim)
fam <- read_fam(file_fam)

# read an existing Plink *.bim file
# pass locus and individual IDs as vectors, setting data dimensions too
X <- read_bed(file_bed, bim$id, fam$id)
X

# can specify without extension
file_bed <- sub('\\.bed$', '', file_bed) # remove extension from this path on purpose
file_bed # verify .bed is missing
X <- read_bed(file_bed, bim$id, fam$id) # loads too!
X


genio documentation built on Jan. 7, 2023, 1:12 a.m.