R/qtl2 (aka qtl2) is a reimplementation of the QTL analysis software R/qtl, to better handle high-dimensional data and complex cross designs.
In this document, we'll describe the technical details of interest to developers wishing to contribute to R/qtl2.
There are three basic aspects to the software:
QTLCross
. We're using
Rcpp to connect from R to C++.We are redefining the basic data structures to be more general (to handle more complex crosses, such as the Collaborative Cross, MAGIC lines, and Diversity Outcross), and to incorporate more information, such as annotations of the phenotypes (for example, with gene expression data, the gene identifiers, physical locations of genes, and the tissue that was measured).
We are also redefining the input data file formats.
The basic data structure in R/qtl, the
"cross"
class, has been redesigned.
A basic principle in the new design is to have the data as close as possible in the form in which it will be used, and to have a more-flat (i.e., less nested) data structure. So don't entangle the marker maps with the genotype data, and pull out the sex information rather than have to find it within the phenotype data whenever it is needed.
A key design decision concerns whether to split out the genotype data and marker maps by chromosome, or to keep them as a simpler table. I'm choosing to have these data split out by chromosome, as much of the analysis tends to proceed by chromosome, and it tends to be easier to combine then to split.
Note: I am fully open to modifying this design. I would be glad for suggestions.
The data remains a list, now of class "cross2"
. It has many
components (some of them optional) in unspecified order; the names of
the components are what matter.
For some cross types (e.g., recombinant inbred lines), we will
separate “lines” from “individuals.” Genotype
data will be on the lines, while phenotype data will be on the
individuals. In this case we will need a mapping from individuals to
lines, "linemap"
(more below). (For a cross type for which the
phenotyped individuals are also the phenotyped ones (e.g., the
Diversity Outcross), think “individual” wherever we write
“line.”)
"crosstype"
)Previously, the cross type (e.g., "bc"
for backcross and "f2"
for
intercross) was part of the "class"
. This was a bad idea.
Now, the data will have a component "crosstype"
that is a single
character string with the cross type:
"bc"
for backcross"f2"
for intercross"riself"
for 2-way RIL by selfing"risib"
for 2-way RIL by sib-mating"dh"
for doubled haploids (like a backcross, but with genotype
codes AA/BB)"haploid"
for haploids (like a backcross, but with genotype codes A/B)"ail"
for 2-way advanced intercross lines"mwriself"
for 2k-way RIL by selfing"mwrisib"
for 2k-way RIL by sib-mating"preriself"
for partially-inbred 2k-way RIL by selfing"prerisib"
for partially-inbred 2k-way RIL by sib-mating"do"
for diversity outcross"hs"
for 8-way heterogeneous stockMost of these have not yet been implemented.
"geno"
)The previous data structure had deep nesting of information; we're going
to split that out and make things more flat. The genotype data will
now be a list of matrices of integers. Each component of the
list is a chromosome; the names are the chromosome names. Each
chromosome has a matrix of n_lines
x n_markers
.
The column names are the marker names, and the row names are the line identifiers (hereafter “IDs”).
The genotypes are observed marker genotypes, represented as
integers. Missing values are coded as 0 (not NA
, as in R/qtl). For
simple crosses, the autosomal genotypes are coded as before.
1
/2
for (AA/BB or A/B)1
/2
for autosomes (for AA/AB)1
/2
/3
/4
/5
for autosomes (for AA/AB/BB/notBB/notAA)However, we're changing the encoding of X chromosome genotypes to
have males coded as if they were homozygous. In an intercross, the X
chromosome encodings are 1
/2
/3
for females (1
/2
for the forward
direction and 2
/3
for the reverse), and 1
/3
for males. In a backcross,
the females are 1
/2
and the males are 1
/3
.
For crosses with > 2 founders, I'm expecting SNPs, and intending,
initially, to assume that the markers are diallelic, encoded
1
/2
/3
(with 2
being the heterozygote).
All of the above is under the assumption that we're using genotypes
calls as the basic marker genotype information. Of course, we will
also want to handle genotype-by-sequencing (GBS) data (which might be
represented as a pair of allele counts), or array intensity
information (which would be represented as a pair of allele
intensities). It seems best to have separate data structures for these
cases, perhaps named geno_gbs
and geno_int
(int
for
“intensity”). These could be three-dimensional arrays
(n_lines
× n_markers
× 2
), with the third
dimension corresponding to the two alleles.
"founder_geno"
)For crosses with > 2 founders, we will have a separate set of
genotypes on the founders. This will again be a list of matrices, each
matrix being the data for a chromosome, of size n_founders
×
n_markers
. I expect these to be diallelic markers, such as SNPs. We
will encode them as 1
/2
/3
(allowing heterozygotes, though I'll
probably treat the hets as missing values).
"is_x_chr"
)A logical vector of length n_chromosomes
indicates which of the
chromosomes is the X chromosome.
"is_female"
)For the treatment of the X chromosome, we need access to sex of the lines.
(That sounds a bit strange; sex is really a feature of individuals and
won't actually be considered for cross types where we separate
individuals from lines.)
We'll have a logical vector ("is_female"
) indicating which
individuals are female. I prefer the logical vector as it's less
susceptible to confusion. (Is 0 female and 1 male, or the other way
around?) This will have length n_lines
, with the name
attribute being the individual IDs.
"cross_info"
)For many cross types, and particularly for the treatment of the X
chromosome, we need line-level information about the nature of the
cross. For an intercross, this is like pgm
(for paternal
grandmother) in R/qtl. For the Collaborative
Cross, we need the order of the founders in the set of crosses that
led to each line. For the AIL and the Diversity Outcross, we need to
know the number of generations of outbreeding.
This "cross_info"
component will be a matrix of integers with
n_lines
rows, and with the number of columns depending on the cross
type.
This information is highly cross-type-specific. We'll leave the details to the discussion of the format of input files.
"gmap"
)The genetic map of the markers is a list of numeric vectors; each vector corresponds to a chromosome and gives the locations of markers in centiMorgans (cM), with the names attribute being the marker names. The markers should be in increasing order.
"pmap"
)We will also allow (and perhaps expect) a physical map of the markers. This will have the same form as the genetic map (with the same chromosomes, the same markers, and with markers in the same order), but with positions in Mbp. (Or perhaps we should use vectors of integers, with positions in basepairs?)
"pheno"
)We will separate out the numeric phenotypes from messier covariates.
(In many cases, we want to perform QTL analysis on a large set of
phenotypes, and having other stuff, like individual IDs, mixed in
there can make things cumbersome). The phenotype data will be a
numeric matrix of size n_individuals
× n_phenotypes
.
Row names are the individual IDs and column names are the phenotype names.
"covar"
)Covariate information, often non-numeric, will form a separate data
frame, of size n_individuals
× n_covariates
. The columns can
be of mixed modes (numeric, factors, character strings, etc.).
Row names are the individual IDs and column names are the covariate names.
"phenocovar"
)We will have a separate data frame of “phenotype
covariates.” These are metadata describing the phenotypes.
The dimension is n_phenotypes
× n_phenocovar
.
For example, in the case of a phenotype measured over time, one column in the phenotype covariate data frame could be the time of measurement.
For gene expression data, we would have columns representing chromosome and physical position of genes, as well as gene IDs. In the case of gene expression on multiple tissues, there could be a column representing the tissue. Or we might have both gene expression and proteomic measurements, and so a column indicating expression vs protein.
"linemap"
)For recombinant inbred lines (including the Collaborative Cross and
MAGIC lines), we will want the genotypes at the line level and the
phenotypes at the individual level. In this case we need a mapping
between them. This will be a vector of character strings containing
the line IDs, with length n_individual
and the names attribute being
the individual IDs. We might make this a two-column matrix, but it's
more convenient as a vector with a names attribute, which acts like an
associative array
(also known as a dictionary, map, hash, or symbol table).
(We might consider having a separate object type containing just the
pheno
, phenocovar
, and linemap
objects. This may simplify the
use of multiple sets of phenotype data.)
"alleles"
)The last bit is a vector of single-character strings, with allele codes for the founders, to be used in various summaries and data visualizations.
"genoprob"
)A critical piece of derived data is the conditional QTL genotype
probabilities, given the observed marker data. This will be a list of
three-dimensional arrays; each array corresponds to a chromosome and is of dimension
n_lines
× n_positions
× n_genotypes
.
An important consideration is the encoding of the genotypes. For crosses with two founder lines, this is straightforward except perhaps for the X chromosome.
For the X chromosome, I'm using phase-known genotypes for the females
and separating the male hemizygous genotypes. For example, for a
backcross, the genotype codes are 1
/2
/3
/4
for AA/AB/AY/BY.
For an intercross, the genotype codes are 1
/2
/3
/4
/5
/6
for
AA/AB/BA/BB/AY/BY.
For crosses with > 2 founders and heterozygous offspring (such as the Diversity Outcross), genotypes will be encoded as integers in the following way, for the phase-unknown case:
encode_geno <- function(g1,g2, phase_known=FALSE, n_alleles=8) { m <- pmax(g1, g2) d <- abs(g1 - g2) result <- choose(m+1, 2) - d if(phase_known) result[g1 > g2] <- result[g1 > g2] + choose(n_alleles+1,2)-1 result } ng <- 8 z <- outer(1:ng, 1:ng, encode_geno) colnames(z) <- rownames(z) <- LETTERS[1:ng] z
If $a_1$ and $a_2$ are the two alleles, then we take $m = \max(a_1, a_2)$ and $d = |a_1 - a_2|$, and the encoding is $\binom{m+1}2 - d$.
In the phase-known case, we can add $\binom{n+1} 2 - 1$ to the lower triangle, where $n$ is the number of alleles, to give the following:
ng <- 8 z <- outer(1:ng, 1:ng, encode_geno, TRUE, ng) colnames(z) <- rownames(z) <- LETTERS[1:ng] z
For simple cross types, we can use the file formats for
R/qtl, use qtl::read.cross
to read in the
data, and then use a conversion function (convert2cross2
) to convert
the data into the new format.
For more complex crosses, we need to define a new format. I was persuaded by Aaron Wolen's idea of a “tidy” format for R/qtl, with three separate CSV files, one for phenotypes, one for genotypes, and one for the genetic map.
Another important idea is from Pjotr Prins's qtab format: the inclusion of metadata, such as genotype encodings, with the primary data. This will simplify the handling of multiple files and will help to avoid mistakes.
And so the basic idea for the new format is to have a separate file for each part of the primary data (genotypes, founder genotypes, genetic map, physical map, phenotypes, covariates, and phenotype covariates), and then a control file which specifies the names of all of those files, the genotype encodings and missing value codes, and things like the name of the sex column within the covariate data (and the encodings for the sexes) and which chromosome is the X chromosome.
A key advantage of the control file scheme is that it greatly
simplifies the function for reading in the data. That function,
read_cross2()
, has a single argument: the name (with path) of the
control file.
The large number of files is a bit cumbersome, so we've made it
possible to use a
[zip file](http://en.wikipedia.org/wiki/Zip_(file_format) containing
all of the data files, and to read that zip file directly (with the same
function, read_cross2()
). The function zip_datafiles()
can be used
to create the zip file.
I describe the details of the input files in a separate vignette. Here I'll give a brief sketch of the structure of the files.
The control file will be in YAML: a human-readable text file for representing relatively complex data. It's much like JSON, but much more readable.
Here's an example, for a sample intercross dataset.
# Data from Grant et al. (2006) Hepatology 44:174-185 # Abstract of paper at PubMed: http://www.ncbi.nlm.nih.gov/pubmed/16799992 # Available as part of R/qtl book package, https://github.com/kbroman/qtlbook crosstype: f2 geno: iron_geno.csv pheno: iron_pheno.csv phenocovar: iron_phenocovar.csv covar: iron_covar.csv gmap: iron_gmap.csv alleles: - S - B genotypes: SS: 1 SB: 2 BB: 3 sex: covar: sex f: female m: male cross_info: covar: cross_direction (SxB)x(SxB): 0 (BxS)x(BxS): 1 x_chr: X na.strings: - '-' - NA
The order of the information is not important, but the names of things are critical.
crosstype
indicates the cross type. geno
, pheno
, phenocovar
,
covar
, and gmap
indicate the names of the files for the different
major pieces of data, all expected to be within the same directory as
the YAML control file.
alleles
indicates the two single-character allele codes for the
founders. The initial dashes are just to indicate that the S
and B
form a vector.
genotypes
gives the genotype codes used in the genotype data
file. SS
, SB
, and BB
are the codes used, to be converted to 1
,
2
, and 3
, respectively. The key: value
structure is for an
associative array.
sex
contains information about the name of the covariate that
represents sex as well as the codes used: the sexes in the
iron_covar.csv
file are coded as f
and m
, and we want to
indicate which one is female
and which is male
.
The format of the control file is maybe a bit technical for some
users, so there's a function write_control_file()
that takes the
control parameters (including file names) as input and writes the YAML
file in the correct form.
Again, I don't want to get into too much detail here. All of the other files are in a simple CSV format. Each is a simple matrix with row names in the first column and column names in the first row.
Genotypes are as lines × markers, with the first column being line IDs and the first row being marker names. The founder genotypes are similar, but with founder lines as the rows.
The phenotype and covariate data are as individuals × variables. The phenotype file must be strictly numeric, while the covariate file can be a mixture of types. The first column in each must be the individual IDs.
The phenotype covariate information is a matrix of phenotypes × phenotype covariates. The first column contains the phenotype names and the first row contains the names of the phenotype covariates.
The genetic and physical maps are in separate files with three columns: marker names, chromosome IDs, and positions.
The individual-to-line mapping ("linemap"
) would most likely be
a column in the covariate data and would be represented in the YAML
file much as sex
is above.
The last piece is "cross_info"
. For an intercross, this can just
be a column in the covariate data (as it is for the example
above). For more complex crosses, it will be a matrix with the rows
being lines, and with the first column being line IDs.
For more detail, see the input file format vignette.
est_map
, we potentially need a separate phase-known class
(explain why, and how this is done)Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.