R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.
The input data file formats for R/qtl cannot handle complex crosses, and so for R/qtl2, we need to define a new input file format. This document describes the details.
For simple cross types, we can continue to use the file formats for
R/qtl, use qtl::read.cross()
to read in the
data, and then use a conversion function (qtl2::convert2cross2()
) to convert
the data into the new format.
For more complex crosses, we need to define a new format. I was persuaded by Aaron Wolen's idea of a “tidy” format for R/qtl, with three separate CSV files, one for phenotypes, one for genotypes, and one for the genetic map.
Another important idea is from Pjotr Prins's qtab format: the inclusion of metadata, such as genotype encodings, with the primary data. This will simplify the handling of multiple files and will help to avoid mistakes.
And so the basic idea for the new format is to have a separate file for each part of the primary data (genotypes, founder genotypes, genetic map, physical map, phenotypes, covariates, and phenotype covariates), and then a control file which specifies the names of all of those files, the genotype encodings and missing value codes, and things like the name of the sex column within the covariate data (and the encodings for the sexes) and which chromosome is the X chromosome.
Before discussing the boring file specifications, let's consider briefly how the data are read into R.
A key advantage of the control file scheme is that it greatly
simplifies the function for reading in the data. That function,
read_cross2()
, has a single argument: the name (with path) of the
control file. So you can read in data like this:
library(qtl2) grav2 <- read_cross2("~/my_data/grav2.yaml")
The large number of files is a bit cumbersome, so we've made it possible to use a [zip file](http://en.wikipedia.org/wiki/Zip_(file_format) containing all of the data files, and to read that zip file directly. There's even a function for creating the zip file:
zip_datafiles("~/my_data/grav2.yaml")
The zip_datafiles()
function will read the control file to identify
all of the relevant data files and then zip them up into a file with
the same name and location, but with the extension .zip
rather than
.yaml
.
To read the data back in, we use the same read_cross2()
function,
providing the name (and path) of the zip file rather than the control
file.
grav2 <- read_cross2("~/my_data/grav2.zip")
This can even be done with remote files.
grav2 <- read_cross2("http://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")
Of course, the other advantage of the zip file is that it is compressed and so smaller than the combined set of CSV files.
The bulk of the data is in a set of comma-delimited (CSV) files. In addition, a control file (in YAML format), contained in the same directory as the CSV files, specifies the file names and other control parameters (such as genotype and sex encodings). Sample data files are available at the R/qtl2 website. We'll discuss the CSV files first.
The comma-delimited (CSV) files are each in the form of a simple matrix, with the first column being a set of IDs and the first row being a set of variable names.
Missing value codes will be specified in the control file (as
na.strings
) and will apply across all files, so a missing value
code for one file cannot be an allowed value in another file.
The genotype data file is a matrix of lines × markers. The first column is the line IDs; the first row is the marker names. The founder genotypes (if needed) are in the same form, with founder lines as rows and markers as columns, and with founder IDs in the first column.
We split the numeric phenotypes from the mixed-mode covariates, as two separate CSV files. Each file forms a matrix of individuals × phenotypes (or covariates), with the first column being individual IDs and the first row being phenotype or covariate names. Sex and line IDs (if needed) can be columns in the covariate data.
A separate CSV file contains phenotype covariate data, as phenotypes × phenotype covariates. The first column contains phenotype names, and the first row contains the names of the phenotype covariates.
Genetic and physical maps of the genotyped markers will be as separate
CSV files, each with three columns: marker, chromosome, and
position. The first row should be marker,chr,pos
but will be
ignored. In the genetic map file, positions should be in centiMorgans
(cM). In the physical map file, positions should be in megabasepairs
(Mbp).
The "cross_info"
data specifies details of the cross that generated
each line (or individual) and is a numeric matrix with lines as rows
(the same number of rows as in the genotype data) and with columns
depending on the cross type.
For simple cross types (e.g., "f2"
, an intercross between two inbred
lines), this cross information may be included as a column in the
covariate data. More generally, the cross information will be a
separate CSV file. For example, for a set of Collaborative Cross (CC)
lines, we will want a matrix with eight columns, which indicate the
order of the founders in the crosses that generated each CC line.
So, in general, the cross information will be in a CSV file with lines as rows and a set of columns that define the cross information for that cross type. The first column contains line IDs and the first row contains column names. Details on the column information are provided in the cross-type-specific information, below.
The new input file format includes a text-based control file (in YAML format) to specify the names of all of the other files as well as various control parameters such as genotype and sex encodings and codes for missing values. We use YAML because it is flexible, readable, and easy to import into R.
The format of the control file is a bit technical. We
describe the details here and also provide a function
write_control_file()
that takes the detailed specifications as input
and contructs the control file in the correct format.
We'll start with an example: the control file for the sample intercross data.
# Data from Grant et al. (2006) Hepatology 44:174-185 # Abstract of paper at PubMed: http://www.ncbi.nlm.nih.gov/pubmed/16799992 # Available as part of R/qtl book package, https://github.com/kbroman/qtlbook crosstype: f2 geno: iron_geno.csv pheno: iron_pheno.csv phenocovar: iron_phenocovar.csv covar: iron_covar.csv gmap: iron_gmap.csv alleles: - S - B genotypes: SS: 1 SB: 2 BB: 3 sex: covar: sex f: female m: male cross_info: covar: cross_direction (SxB)x(SxB): 0 (BxS)x(BxS): 1 x_chr: X na.strings: - '-' - NA
Any line that begins with a “#
” is treated as a comment and
ignored. It's good to include some comments at the top of the file, describing the
dataset.
The order of things within the file is not important, but the names of things are critical.
Much of the information is represented as key-value pairs, as
“key: value
.” For example, the cross type is indicated with a line like
crosstype: f2
The “key
” is “crosstype
” and the
“value
” is “f2
.” This indicates that the
data are for an F2 intercross between two inbred lines.
The names of the basic CSV files are indicated with lines like
geno: iron_geno.csv
This indicates that the genotype data are in the file iron_geno.csv
.
The files are expected to be in the same directory as the control
file. They could be placed in separate directories, with the file names
being paths relative to the location of the control file, but this
is not recommended (or well tested).
The “keys” for the different files are the following:
geno: genotype_filename founder_geno: founder_genotype_filename pheno: phenotype_filename covar: covariate_filename phenocovar: phenotype_covariate_filename gmap: genetic_map_filename pmap: physical_map_filename
Most of these files are optional; if a particular file is not used, the corresponding key can be omitted from the control file.
If one of the chromosomes is to be treated as the X chromosome, there should be a line like
x_chr: X
This specifies the chromosome ID for the X chromosome (X
in this case).
To add labels in summary tables and plots, provide a vector of single-character allele labels, with one for each founder line. For example,
alleles: - S - B
This list of items, each beginning with a hyphen and a space, is the
YAML format for a vector. It is equivalent to the R code c("S", "B")
.
You could also write this line as
alleles: [S, B]
which is an alternative format for vectors in YAML.
The control file should contain a record with “genotypes:
” that specifies
the genotype encodings. Here's an example:
genotypes: SS: 1 SB: 2 BB: 3
For each possible genotype code, indent and provide a “key: value
”
pair, with the key being the code used in the genotype and founder
genotype files, and the value being an integer to which the genotype
should be converted.
The above example would be suitable for a backcross or intercross. For
a backcross, the second homozygote (BB
in this case) is only needed
in the case that there are X chromosome genotypes for males.
For RIL, we would use something like
genotypes: BB: 1 DD: 2
For crosses with multiple parents, the genotype file should contain
genotype calls for a set of SNPs, and there should be a corresponding
founder genotype file with genotypes of the founders at those SNPs.
A common set of genotype codes needs to be used for all SNPs.
In particular, the genotypes cannot be encoded as AA
, CC
,
GG
, TT
, AC
, AG
, because then, e.g., CC
would need to be
treated as 1
for some SNPs and 3
for others. Instead, code the
genotypes with something like AA
, AB
, BB
, and then include the
following in the control file:
genotypes: AA: 1 AB: 2 BB: 3
Sex can be provided as a column in the covariate file or as a separate file.
If it is a column in the covariate file, the control file should have a section that looks like this:
sex: covar: sex f: female m: male
Here, “covar: sex
” indicates that the column name used in the
covariate file is “sex
.” If the column name were “Sex
,” you would
write “covar: Sex
.”
The other two “key: value
” pairs are the encodings used for
sex, with the “keys” being the codes used in the covariate file and the
“values” being female
and male
. So this indicates that sex was
encoded as f
for females and m
for males. If, instead, the sex
covariate had 0
for females and 1
for males, you would use:
sex: covar: sex 0: female 1: male
Sex information can also be provided as a separate file. In this case, the file should have two columns: individual ID, and sex. Further, the part of the control file dealing with sex should look like this:
sex: file: sex_filename f: female m: male
So instead of a line with “covar:
,” use “file:
” followed by the name
of the file (e.g., “file: iron_sex.csv
”). You must still provide the sex
encodings, as before.
For simple crosses (e.g., an intercross), cross information can be a single column within the covariate file. In this case, include something like the following in the control file:
cross_info: covar: cross_direction (SxB)x(SxB): 0 (BxS)x(BxS): 1
This is much like the information for sex. The “covar:
” line
indicates the name of the column in the covariate data that
corresponds to the cross information. The other two lines indicate
the encodings of the cross information as “key: value
”
pairs, where “key
” is
the code used in the cross information column and “value
” is the
integer to which it should be converted.
More generally, the cross information would be contained in a separate comma-delimited file. For simple crosses, in which the cross information is a single column, we allow it to be encoded differently from what is needed, and the control file information should look like this:
cross_info: file: crossinfo_filename (SxB)x(SxB): 0 (BxS)x(BxS): 1
For more complex crosses (e.g., the Collaborative Cross), the cross information spans multiple columns and we require that the user have set this up in advance (i.e., no translation of encodings will be performed). In this case the relevant section of the control file looks like this:
cross_info: file: crossinfo_filename
Or, more simply, you could write:
cross_info: crossinfo_filename
"linemap"
)For crosses with multiple phenotyped individuals for each genotyped
line, we need a mapping of individuals to lines ("linemap"
). This
can be a single column in the covariate file, or it can be a separate
file.
If the individual-to-line mapping is provided as a column in the covariate data, the control file information should look like this:
linemap: covar: linemap_column_name
Or, more simply, write:
linemap: linemap_column_name
If, instead, the mapping is provided as a separate file, write:
linemap: file: linemap_filename
Or, more simply, write:
linemap: linemap_filename
If a construction like “linemap: value
” is used, we look to see if
“value
” corresponds to the name of a file; otherwise, we treat it as a
column name in the covariate data. But the use of “covar:
” or “file:
”
is more explicit and so may be preferred.
To indicate the set of codes that are to be treated as missing values
in the genotype, founder genotype, phenotype, covariate, and phenotype
covariate files, define na.strings
within the control file:
na.strings: - NA - '-'
A hyphen needs to be surrounded in single- or double-quotes. Many
other character strings (such as NA
) do not. This is a similar
contruction as for the allele codes above; the list with hyphens
followed a space is the YAML format for a vector. You could also
write:
na.strings: [NA, '-']
which is another way to define a vector with YAML.
If the data files use a separator other than a comma (e.g., a
semi-colon, or the vertical bar (|
) which I like because it is seldom
present in data), indicate the separator within the control file, as
follows:
sep: '|'
A vertical bar needs to be surrounded by single- or double-quotes. A semicolon doesn't, but it doesn't hurt if you do.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.