read.data | R Documentation |
Load data stored in the format of a structure file or similar data table.
read.data(
file,
mainparams = NULL,
extracol.names = NULL,
precol.headers = 1,
nprecol,
markername.dup = 0,
NUMLOCI.autoAccept = TRUE,
EXTRACOL = 0,
INDLABEL = 1,
LOCDATA = 0,
MAPDISTANCE = 0,
MARKERNAME = 1,
MISSINGVAL,
NUMINDS = 0,
NUMLOCI = 0,
ONEROW = 0,
PHASED = 0,
PHENOTYPE = 0,
PLOIDY = 2,
POPFLAG = 0,
POPID = 1,
RECESSIVEALLELE = 0,
marker.info.file = NULL,
sourceAbsent = FALSE
)
file |
Character string. The name of the data file to read in. |
mainparams |
Character string. The name of an associated ‘mainparams’
file to read in. Optional as |
extracol.names |
Character string or vector. Names of extra (i.e. none of those specifically named here)
non-marker columns in the data file. Optional; used
when the data file does not contain pre-marker EXTRACOL headers and you
wish to add them on input. Default is |
precol.headers |
Numeric. Presence ( |
nprecol |
Numeric. Number of pre-marker columns in the data file. No default and must always be entered. Set to 2 for standard ‘plink’ structure files. |
markername.dup |
Numeric. Whether each marker name appears twice in
the data file header row. Default is |
NUMLOCI.autoAccept |
Logical. If the number of loci (NUMLOCI) is not entered, whether to require manual acceptance/rejection of
the internally calculated NUMLOCI. Default is |
EXTRACOL |
Numeric. Number of extra (i.e. none of those
specifically named here) non-marker columns in the data file. Not needed if this info is
uploaded in a ‘mainparams’ file. For compatibility with ‘structure’ |
INDLABEL |
Numeric. Presence ( |
LOCDATA |
Numeric. Presence ( |
MAPDISTANCE |
Numeric. Presence ( |
MARKERNAME |
Numeric. Presence ( |
MISSINGVAL |
The identifier for missing values. No default. Not needed if the info is uploaded in a ‘mainparams’ file. Currently,
|
NUMINDS |
Numeric. Number of individuals in the data set. Required to check that the marker data are of the appropriate dimensions. Default
set to |
NUMLOCI |
Numeric. Number of loci in the data set. Not needed if the info is uploaded in a ‘mainparams’ file. If the default value ( |
ONEROW |
Numeric. Whether ( |
PHASED |
Numeric. Presence ( |
PHENOTYPE |
Numeric. Presence ( |
PLOIDY |
Numeric. Maximum ploidy among the markers in the data file. Not needed if the info is uploaded in a ‘mainparams’ file.
Default is |
POPFLAG |
Numeric. |
POPID |
Numeric. Presence ( |
RECESSIVEALLELE |
Numeric. |
marker.info.file |
Character string. The name of an optional file of additional marker information to be read in, with one row per marker and any number of columns.
This can be joined to the primary output table (and to the locus table if |
sourceAbsent |
Logical. Whether source (parental reference) samples are absent from the dataset. If |
read.data
is designed to be compatible with data input files
for the software structure, including those produced by
plink. However, ‘plink’ structure files use zero to denote missing data,
which is not currently allowed in the data.table
package, although this limitation is currently being actioned. For the time being, zeroes should be replaced prior to use of the read.data
function,
preferably with NA (no quotations needed).
The simplest file format to read in is
a rectangular data table with a complete header row, one column per marker and one row per allele copy (equivalent to ONEROW
= 0
). With this format the only required fields other than those with default settings
are file, nprecol, NUMINDS and MISSINGVAL. Furthermore, allele identities are
read in as character strings and therefore each allele can be given any
name or number, as long as there are only two unique alleles per locus.
As with structure, all non-marker
columns should be to the left of the marker columns in the data file.
The data input file is required to have at least one column prior to the
marker data columns. Typically there are two columns, the first (required) is referred to as ‘INDLABEL’ (but can be
given a different column header) and represents the finest resolution of
identification that may be required for the data set (usually a unique
individual reference, but could for example be a population-level
reference for poolseq data). The second is referred to as ‘POPID’ and
indicates the population from which the individual was sampled. The main
purpose of the ‘POPID’ column is to declare, downstream, which POPID values
represent the parental reference samples, and it can also be useful for
plotting hybrid index estimates. If the ‘POPID’ column is absent or if it is desired to use another column to
identify parental reference samples, an alternative column must be identified downstream in the data.prep
function.
Any ploidy is allowed, but the declared PLOIDY must be the same for all markers. So for example if data are haplodiploid, haploid markers should either be present as diploid homozygotes, or the second allele declared as missing data.
If an associated ‘mainparams’ file is read in, it should contain two columns
and no header row, with the first column holding the field names and the
second the field values. The only mandatory fields are
(and must have one of these synonyms): PLOIDY
,
MISSINGVAL
/MISSING
, ONEROW
/ONEROWPERIND
, INDLABEL
, POPID
,
EXTRACOL
/EXTRACOLS
, MARKERNAME
/MARKERNAMES
. If RECESSIVEALLELE
or
MAPDISTANCE
are present, they can be pluralized.
read.data
uses the fread
function from data.table
to rapidly read in the data file. The
loaded data file is therefore of class data.table
and data.frame
.
A data.table
can be treated the same as a standard data.frame
for those not familiar with the package.
read.data
returns a list with one component containing the loaded data in ONEROW = 0
format, in the
form of a data.table
and data.frame
, and other components used in downstream functions or otherwise potentially
useful to the user.
The list contains the following components:
mainparams |
A |
nprecols |
Numeric. The number of non-marker columns to the left of the marker columns in the imported data set. |
precols |
Character vector. The names of non-marker columns to the left of the marker columns in the imported data set. |
data |
A |
alleles |
A |
loci |
Character vector. The names of all loci in the imported data set. |
marker.info |
A |
Richard Ian Bailey richardianbailey@gmail.com
## Not run:
#First create example data files and save to the working directory#
#1. Genomic (SNP) data as a regular table with full header row#
ex <- "INDLABEL,POPID,chr1:001,chr1:002\nind1,pop1,A,A\nind1,pop1,A,B\nind2,pop1,B,B\nind2,pop1,B,B\nind3,pop2,NA,A\nind3,pop2,B,A\n"
#Save 'ex' to the working directory as 'ex.data', to be read in by read.data. Open in e.g. notepad or notepad++ to view in table format#
cat("INDLABEL POPID chr1:001 chr1:002","ind1 pop1 A A","ind1 pop1 A B","ind2 pop1 B B","ind2 pop1 B B","ind3 pop2 NA A","ind3 pop2 B A", file = "ex.data", sep = "\n")
#2. Now create and save a file, ex2.data, in plink structure format (but with missing data recoded from 0 to NA), with a MAPDISTANCE row present below the marker names#
cat("chr1:001 chr1:002","-3 3 2 4","ind1 pop1 A A A B","ind2 pop1 B B B B","ind3 pop2 NA B A A", file = "ex2.data", sep = "\n")
#Load ex.data, in this case without specifying the number of loci (useful if this is not known in advance). These are the minimum arguments required for a file in this format#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,MISSINGVAL=NA)#NUMLOCI is not specified so will be calculated, with a warning#
#The same but specifying the number of loci. The resulting object is identical#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA)#No warning this time as NUMLOCI is specified#
#The minimum arguments required to read in the PLINK structure format file with missing data recoded as 'NA' (gives a warning because NUMLOCI is not specified)#
dat2 <- read.data(file="ex2.data",MISSINGVAL=NA,NUMINDS=3,nprecol=2,ONEROW=1,MAPDISTANCE=1,precol.headers=0)
#Same, but displaying all arguments including those where the default works for this file format#
dat2 <- read.data(
file="ex2.data",
mainparams = NULL, #Default#
extracol.names = NULL, #Default#
precol.headers = 0,
nprecol=2,
markername.dup = 0, #Default#
NUMLOCI.autoAccept = TRUE, #Default#
EXTRACOL = 0, #Default#
INDLABEL = 1, #Default#
LOCDATA = 0, #Default#
MAPDISTANCE = 1,
MARKERNAME = 1, #Default#
MISSINGVAL = NA, #Cannot be zero for the time being#
NUMINDS = 3,
NUMLOCI = 2, #Only mandatory when PHASED=1; otherwise will be calculated internally if not known#
ONEROW = 1,
PHASED = 0, #Default#
PHENOTYPE = 0, #Default#
PLOIDY = 2, #Default#
POPFLAG = 0, #Default#
POPID = 1, #Default#
RECESSIVEALLELE = 0, #Default#
marker.info.file = NULL, #Default#
sourceAbsent = FALSE #Default#
)
#If NUMLOCI is not specified, the number of loci will be calculated with a warning, which will not interfere with downstream processes.
#However, if 'NUMLOCI.autoAccept = FALSE' is set, the user is required to manually accept the calculated number of loci. This option is
#included in case the user wishes to verify that the calculated number of loci is accurate. Stops with an error if the estimated NUMLOCI
#is manually rejected. Example:
dat2 <- read.data(file="ex2.data",NUMLOCI.autoAccept = FALSE,MISSINGVAL=NA,NUMINDS=3,nprecol=2,ONEROW=1,MAPDISTANCE=1,precol.headers=0)
#Uploading a marker info file.
#Example: Create a marker info file indicating whether the locus is intronic or exonic, and save to the working directory#
cat("locus type","chr1:001 intronic","chr1:002 exonic", file = "ex_marker_info.data", sep = "\n")
#Read it in alongside the SNP data#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA,marker.info.file = "ex_marker_info.data")
#Loading a data file that does not include parental reference samples. For the situation where parental (S0 and S1) allele frequencies are
#to be loaded, but not parental reference samples. A marker.info.file can be loaded regardless of whether S0 and S1 samples are present in
#the dataset, but it is obligatory if they are absent. When 'sourceAbsent = TRUE', as a minimum the following columns with the exact headers
#in the first set of quotation marks below are required in the marker.info.file (more columns are allowed).
cat("locus refAllele alternateAllele S0.prop_r S1.prop_r type","chr1:001 A B 0.1 0.9 intronic","chr1:002 B A 0.8 0.2 exonic", file = "ex_marker_info2.data", sep = "\n")
#"...prop_r" means the allele frequency of the reference allele. Choice of reference and alternate alleles is arbitrary#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA,marker.info.file = "ex_marker_info2.data", sourceAbsent = TRUE)
#The data.prep function will then determine which allele has higher frequency in S1, as it would if parental reference samples were included#
unlink("ex_marker_info.data")#Tidy up#
unlink("ex_marker_info2.data")#Tidy up#
unlink("ex.data")#Tidy up#
unlink("ex2.data")#Tidy up#
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.