convert_plink: Converts PLINK binary format to SNP formatted file.
In stefanedwards/Siccuracy: Pipeline Package for AlphaImpute

Description Usage Arguments Details Assigning new IDs Filtering loci or samples Fragment file names References See Also

Converts PLINK binary format to SNP formatted file.

convert_plink(bfile, outfn, na = 9, newID = 0, nlines = NULL,
  fam = NULL, bim = NULL, bed = NULL, countminor = TRUE, maf = 0,
  chr = NULL, extract = NULL, exclude = NULL, extract_chr = NULL,
  keep = NULL, remove = NULL, method = "simple", fragments = "chr",
  remerge = TRUE, fragmentfns = NULL)

`bfile`	Filename of PLINK binary files, i.e. without extension.
`outfn`	Filename of new file.
`na`	Missing value.
`newID`	Integer scalar (default `0`) for automatically assigning new IDs. See description for more.
`nlines`	Number of lines to process.
`fam`	If binary files have different stems, specify each of them with `fam`, `bim`, `bed`, and set `bfile=NULL`.
`bim`	See `fam`.
`bed`	See `fam`.
`countminor`	Logical: Should the output count minor allele (default), or major allele as `plink --recode A`.
`maf`	Numeric, restrict SNPs to SNPs with this frequency.
`chr`	Vector of chromosomes to limit output to.
`extract`	Extract only these SNPs, see Details.
`exclude`	Do not extract these SNPs, see Details.
`extract_chr`	Extract only these chromosomes, see Details.
`keep`	Keep only these samples, see Details.
`remove`	Removes these samples from output, see Details.
`method`	Character, which of following methods to use: `simple`, `lowmem`, or `drymem`. See Details.
`fragments`	`"chr"` or integer vector. Only used when `method='lowmem'`.
`remerge`	Logical, whether to re-merge fragmented blocks. Only used when `method='lowmem'`.
`fragmentfns`	Character vector or function for producing filenames.

method simple stores entire genotype matrix in memory, as PLINK binary files are stored in locus-major mode, i.e. first m bits store first locus for all n animals. Since we are interested in writing out all m loci for each animal, for efficiency we need to read the entire file. lowmem breaks the loci into smaller chunks (e.g. by chromosome), writes each chunk to a file, and merges them back as with cbind_SNPs. dryrun does not call the Fortran subroutine, but returns the treated arguments that would have been sent to the subroutine.

For method='lowmem' use argument fragment to indicate how the loci are subdivided. When fragment='chr' (case unsensitive), loci are split according to 1st column of .bim file. If fragment is a scalar integer, loci are split into this number of blocks. If an integer vector of same length as ncol, it directly specifies which block a locus is sent to. max(fragment) specifies the number of blocks.

The new integer IDs can be supplied. If not, they will be made for you. newID may be an integer vector and will be used as is. If data.frame with columns famID, sampID, and newID, they will be reordered to match input file.

Filters on loci or samples can be employed in a number of ways; filtering on loci and samples are handled independently. Inclusion criteria (extract and keep) reduces the output to only those loci or samples that pass the criteria. Exclusion criteria (exclude and remove) are applied after inclusion criteria, and reduces the output further.

extract and exclude can be any combination of:

Logical: Vector of same length as loci in input file.
Integer or numeric: Indicates positional which loci to include or exclude. Numeric vectors are coerced to integer vectors.
Character: Matched against probe IDs, i.e. 2nd column of .bim file.

For restricting the output to certain chromosomes, use extract_chr. The output is the intersect of extract and exctract_chr.

keep and remove are as exctract and exclude above, can be a combination of, and can additionally be:

Character: Matched against both famID or sampID, i.e. 1st and 2nd column of .fam file.
List with named elements famID and/or sampID: The named elements are matched against, respectively, the 1st and 2nd column of the .fam file.

The argument fragmentfns is used for method 'lowmem', providing filenames (absolute or relative) for producing the final converted files and intermediate .bim files. When remerge=TRUE, the argument outfn is ignored.

fragmentfns defaults to temporary files, created with tempfile. If a character vector, the first $n_f$ elements are filenames for $n_f$ fragments (e.g. chromosomes). The following $n_f + 1 ... 2 n_f$ elements are for the intermediate .bim files. The vector is automatically padding with temporary files to the required length.

If fragmentfns is a function, it will be called with 0, 1, or 2 arguments. The first argument is a running number for the fragments, the second is the maximum number of fragments.

PLINK v. 1.07 BED file format: https://www.cog-genomics.org/plink/1.9/formats#bed
Shaun Purvell and Christopher Chang. PLINK v. 1.90 https://www.cog-genomics.org/plink2
Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4. doi: 10.1186/s13742-015-0047-8 link.

convert_plink is a direct conversion that does not rely on PLINK. See the alternate convert_plinkA which re-formats the output from plink --recode A.

stefanedwards/Siccuracy documentation built on May 30, 2019, 10:44 a.m.