Nothing
#' @title Read marker data.
#'
#' @description
#' A function for reading in marker data. Two types of data can be read.
#' @param filename contains the name of the marker file. The file name needs to be in quotes.
#' @param type specify the type of file. Choices are 'text' (the default) and PLINK.
#' @param missing the number or character for a missing genotype in the text file. There is no need to specify this for a PLINK ped file. Missing
#' allele values in a PLINK file must be coded as '0' or '-'.
#' @param AA the character or number corresponding to the 'AA' snp genotype in the marker genotype file.
#' This need only be specified if the file type is 'text'. If a character then it must be in quotes.
#' @param AB the character or number corresponding to the 'AB' snp genotype in the marker genotype file.
#' This need only be specified if the file type is 'text'.
#'This can be left unspecified
#' if there are no heterozygous genotypes (i.e. the individuals are inbred). Only a single
#' heterozygous genotype is allowed ('Eagle' does not distinguish between 'AB' and 'BA').
#' If specified and a character, it must be in quotes.
#' @param BB the character or number corresponding to the 'BB' snp genotype in the marker genotype file.
#' This need only be specified if the file type is 'text'. If a character, then it must be in quotes.
#' @param availmemGb a numeric value. It specifies the amount of available memory (in Gigabytes).
#' This should be set to be as large as possible for best performance.
#' @param quiet a logical value. If set to \code{TRUE}, additional runtime output is printed.
#'
#' @details
#'
#' \code{ReadMarker} can handle two different types of marker data; namely,
#' genotype data in a plain text file, and PLINK ped files.
#'
#' \subsection{\strong{Reading in a plain text file containing the marker genotypes}}{
#' To load a text file that contains snp genotypes, run \code{ReadMarker} with \code{filename} set to the name of the file,
#' and \code{AA}, \code{AB}, \code{BB} set to the corresponding genotype values.
#' The genotype values in the text file can be numeric, character, or a mix of both.
#'
#' We make the following assumptions
#' \itemize{
#' \item{The text file does not contain row or column headings}
#' \item{The file is allowed to contain missing genotypes that have been coded according to \code{missing}}
#' \item{Individuals are diploid}
#' \item{The rows of the text file are the individuals and the columns are the marker loci}
#' \item{The file is space separated}
#' \item{The mapping of the observed genotypes in the marker file to \code{AA}, \code{AB}, and \code{BB}, remains the same for all loci}
#' \item{Individuals are outbred when \code{AA}, \code{AB}, and \code{BB} are specified and
#' inbred when only \code{AA}, and \code{BB} are specified}
#' \item{For a text file, the same alphanumeric value is used for all missing marker genotypes. For a PLINK ped file, the missing allele is allowed to
#' be '0' or '-'.}
#'}
#'
#' For example, suppose we have a space separated text file with marker genotype data collected from five snp loci on three individuals
#' where the snp genotype AA has been coded 0, the snp genotype AB has been coded 1, the snp genotype BB has been coded 2, and missing genotypes are coded
#' as 99
#' \tabular{ccccc}{
#' 0 \tab 1 \tab 2 \tab 0\tab 2 \cr
#' 1 \tab 1 \tab 0 \tab 2 \tab 0 \cr
#' 2 \tab 2 \tab 1 \tab 1 \tab 99
#'}
#' The file is called geno.txt and is located in the directory /my/dir/.
#'
#'
#' To load these data, we would use the command
#'
#' \preformatted{geno_obj <- ReadMarker(filename='/my/dir/geno.txt', AA=0, AB=1, BB=2, type='text', missing=99)}
#'
#' where the results from running the function are placed in \code{geno_obj}.
#'
#' As another example, suppose we have a space separated text file with marker genotype data collected from five snp loci on three individuals
#' where the snp genotype AA has been coded a/a, the snp genotype AB has been coded a/b, and the snp genotype BB has been coded b/b
#' \tabular{c}{
#' a/a a/b b/b a/a b/b \cr
#' a/b a/b a/a b/b a/a \cr
#' b/b b/b a/b a/b NA
#'}
#' The file is called geno.txt and is located in the same directory from which R is being run (i.e. the working directory).
#'
#' To load these data, we would use the command
#'
#' \preformatted{geno_obj <- ReadMarker(filename='geno.txt', AA='a/a', AB='a/b', BB='b/b',
#' type='text', missing = 'NA')}
#'
#' where the results from running the function are placed in \code{geno_obj}.
#'}
#'
#' \subsection{\strong{Reading in a PLINK ped file}}{
#' PLINK is a well known toolkit for the analysis of genome-wide association data. See
#' \url{https://www.cog-genomics.org/plink2}
#' for details.
#'
#' Full details of PLINK ped files can be found \url{https://www.cog-genomics.org/plink/1.9/formats#ped}. Briefly,
#' the PED file is a space delimited file (tabs are not allowed): the first six columns are mandatory:
#'
#' \tabular{l}{
#' Family ID \cr
#' Individual ID \cr
#' Paternal ID \cr
#' Maternal ID \cr
#' Sex (1=male; 2=female; other=unknown) \cr
#' Phenotype
#'}
#'
#'
#' Here, these columns can be any values since \code{ReadMarker} ignores these columns.
#'
#' Genotypes (column 7 onwards) can be any character
#' (e.g. 1,2,3,4 or A,C,G,T or anything else) except 0 which is, by default, the missing genotype character. All markers should be biallelic.
#' All snps must have two alleles specified. Missing alleles (i.e 0 or -) are allowed.
#' No column headings should be given.
#'
#' As an example, suppose we have data on three individuals genotyped for four snp loci
#' \tabular{cccccccccccccc}{
#' FAM001 \tab 101 \tab 0 \tab 0 \tab 1 \tab 0 \tab A \tab G \tab C \tab C \tab C \tab G \tab A \tab A \cr
#' FAM001 \tab 201 \tab 0 \tab 0 \tab 2 \tab 0 \tab A \tab A \tab C \tab T \tab G \tab G \tab T \tab A \cr
#' FAM001 \tab 300 \tab 101 \tab 201 \tab 2 \tab 0 \tab G \tab A \tab T \tab T \tab C \tab G \tab A \tab T
#'}
#'
#' Then to load these data, we would use the command
#'
#' \preformatted{geno_obj <- ReadMarker(filename='PLINK.ped', type='PLINK')}
#'
#' where \code{geno_obj} is used by \code{\link{AM}}, and the file PLINK.ped is located in the working directory (i.e. the directory from which R
#' is being run).
#'}
#'
#'
#' \subsection{Reading in other formats}{
#' Having first installed the stand-alone PLINK software, it is possible to convert other file formats into PLINK ped files. See \url{https://www.cog-genomics.org/plink/1.9/formats} for details.
#'
#' For example, to convert vcf file into a PLINK ped file, at the unix prompt, use the PLINK command
#'
#' \preformatted{PLINK --vcf filename.vcf --recode --out newfilename}
#'
#' and to convert a binary ped file (bed) into a ped file, use the PLINK command
#'
#' \preformatted{PLINK --bfile filename --recode --tab --out newfilename}
#'
#'}
#'
#'
#'
#' @return To allow \code{\link{AM}} to handle data larger than the memory capacity of a machine, \code{ReadMarker} doesn't load
#' the marker data into memory. Instead, it creates a reformatted file of the marker data and its transpose. The object returned by
#' \code{ReadMarker} is a list object with the elements \code{asciifileM} , \code{asciifileMt}, and \code{dim_of_ascii_M}
#' which is the full file name (name and path) of the reformatted file for the marker data, the full file name of the reformatted file
#' for the transpose of the marker data, and a 2 element vector with the first element the number of individuals and the second
#' element the number of marker loci.
#'
#'
#' @examples
#' #--------------------------------
#' # Example 1
#' #-------------------------------
#' #
#' # Read in the genotype data contained in the text file geno.txt
#' #
#' # The function system.file() gives the full file name (name + full path).
#' complete.name <- system.file('extdata', 'geno.txt', package='Eagle')
#' #
#' # The full path and name of the file is
#' print(complete.name)
#'
#' # Here, 0 values are being treated as genotype AA,
#' # 1 values are being treated as genotype AB,
#' # and 2 values are being treated as genotype BB.
#' # 4 gigabytes of memory has been specified.
#' # The file is space separated with the rows the individuals
#' # and the columns the snp loci.
#' geno_obj <- ReadMarker(filename=complete.name, type='text', AA=0, AB=1, BB=2, availmemGb=4)
#'
#' # view list contents of geno_obj
#' print(geno_obj)
#'
#' #--------------------------------
#' # Example 2
#' #-------------------------------
#' #
#' # Read in the allelic data contained in the PLINK ped file geno.ped
#' #
#' # The function system.file() gives the full file name (name + full path).
#' complete.name <- system.file('extdata', 'geno.ped', package='Eagle')
#'
#' #
#' # The full path and name of the file is
#' print(complete.name)
#'
#' # Here, the first 6 columns are being ignored and the allelic
#' # information in columns 7 - 10002 is being converted into a reformatted file.
#' # 4 gigabytes of memory has been specified.
#' # The file is space separated with the rows the individuals
#' # and the columns the snp loci.
#' geno_obj <- ReadMarker(filename=complete.name, type='PLINK', availmemGb=4)
#'
#' # view list contents of geno_obj
#' print(geno_obj)
#'
#'
ReadMarker <- function( filename=NULL, type='text', missing=NULL,
AA=NULL, AB=NULL, BB=NULL,
availmemGb=16,
quiet=TRUE){
if (nargs() == 0){
## checking that function has arguments
message(" Please supply arguments to function \n")
return(NULL)
} else {
## read in either a text file or a PLINK file. The parameter type must be specified. Default it text file.
if(is.null(type)){
message(" type must be set to \"text\" or \"PLINK\". \n")
message(" ReadMarker has terminated with errors.")
return(NULL)
}
if(!(type=="text" || type=="PLINK") ){
message(" type must be set to \"text\" or \"PLINK\". \n")
message(" ReadMarker has terminated with errors")
return(NULL)
}
## ------ PLINK ped file -------------------
if (type=="PLINK"){
## checking if a PLINK file has been specified.
if (is.null(filename)){
message(" The name of the PLINK ped file is missing.")
message(" ReadMarker has terminated with errors.")
return(NULL)
}
if (!file.exists(fullpath(filename) )){
message(" The PLINK ped file ", filename, " could not be found. ")
message(" ReadMarker has terminated with errors ")
return(NULL)
}
## Rcpp function to get dimensions of PLINK ped file
dim_of_ascii_M <- getRowColumn(fname=fullpath(filename))
dim_of_ascii_M[2] <- (dim_of_ascii_M[2] - 6)/2 ## adjusting for PLINK structure
## Rcpp function to create binary packed M and Mt file
it_worked <- create.ascii(file_genotype=fullpath(filename), type=type, availmemGb=availmemGb, dim_of_ascii_M=dim_of_ascii_M, quiet=quiet )
if(!it_worked)
return(NULL)
if(.Platform$OS.type == "unix") {
asciifileM <- paste(tempdir(), "/", "M.ascii", sep="")
asciifileMt <- paste(tempdir(), "/", "Mt.ascii", sep="")
} else {
asciifileM <- paste(tempdir() , "\\", "M.ascii", sep="")
asciifileMt <- paste(tempdir() , "\\", "Mt.ascii", sep="")
}
} else {
## ------------ text file -----------------------
## Assuming a text file that may be comma separated with numeric genotypes that need to be mapped onto AA, AB, and BB.
## check of parameters
error.code <- check.inputs(file_genotype=filename, availmemGb=availmemGb)
if(error.code){
message(" ReadMarker has terminated with errors.")
return(NULL)
}
## Has AA, AB, BB been assigned character values
if(is.null(AA) || is.null(BB))
{
message("Error: The function parameters AA and BB must be assigned a numeric or character value since a text file is being assumed. \n")
message(" Type help(ReadMarker) for help on how to read in the marker data. \n")
message(" ReadMarker has terminated with errors")
return(NULL)
}
## if there are no hets.
if(is.null(AB))
AB <- "NA" ## no hets
genofile <- fullpath( filename)
## Rcpp function to get dimensions of ASCII genotype file
message(" Getting number of individuals and snp from file ... ")
dim_of_ascii_M <- getRowColumn(fname=genofile)
## Rcpp function to create ascii M and Mt file from
message(" Beginning creation of reformatted file ... ")
it_worked <- create.ascii(file_genotype=genofile, type=type, AA=as.character(AA), AB=as.character(AB), BB=as.character(BB),
availmemGb=availmemGb, dim_of_ascii_M=dim_of_ascii_M, quiet=quiet, missing=missing )
if(!it_worked) ## error has occurred.
return(NULL)
if(.Platform$OS.type == "unix") {
asciifileM <- paste(tempdir() , "/", "M.ascii", sep="")
asciifileMt <- paste(tempdir() , "/", "Mt.ascii", sep="")
} else {
asciifileM <- paste(tempdir() , "\\", "M.ascii", sep="")
asciifileMt <- paste(tempdir() , "\\", "Mt.ascii", sep="")
}
} ## end if else nargs()==1 (PLINK case)
geno <- list("asciifileM"=asciifileM, "asciifileMt"=asciifileMt,
"dim_of_ascii_M" = dim_of_ascii_M)
if(.Platform$OS.type == "unix") {
RDatafile <- paste(tempdir() , "/", "M.RData", sep="")
} else {
RDatafile <- paste( tempdir() , "\\", "M.RData", sep="")
}
save(geno, file=RDatafile)
## create M.Rdata file in current directory
return(geno)
} ## end if else nargs()==0
} ## end function call ReadMarker
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.