This is a tutorial describing how to use ptlmapper. The tutorial is based on the analysis of a simulated dataset.
The dataset was generated from a simulator. The implemented model described the expression of a single gene with positive autoregulation. Protein production followed the equation: [P = \frac{\alpha_0 + \alpha_1 (\frac{P}{K})^h}{1+(\frac{P}{K})^h}]
We considered two different genotypic groups, that were characterized by the following sets of parameters:
[\left\lbrace \begin{array}{l} \alpha_0 = 0.1 \ \alpha_1 = 10 \ K = 1.6 \end{array} \right.]
[\left\lbrace \begin{array}{l} \alpha_0 = 6.3\ \alpha_1 = 12\ K = 10 \end{array} \right.]
We assumed that the set of parameters was controlled by one locus that could exist in two different allelic forms (A and B).
For each individual, we generated 20,000 cells according a given set of parameter values. In addition, we added a universal noise term $n$ to introduce intra-genotype inter-individual varation. This correspond to variations of the level of errors on the parameter values.
We considered a panel of individuals and their genotype at 200 markers evenly spaced every 5cM.
For simplicity, individuals were considered as haploid segregants derived from a cross between two haploid parents of genotype A and B, respectively.
The dataset is available as an R package called ptldata
(https://github.com/fchuffar/ptldata).
The ptldata
package contains the following objects:
library(ptldata) indiv = ptldata::indiv genodata = ptldata::genodata cells = ptldata::cells
which are described below.
indiv
displays relevant information on all individuals. It is a dataframe composed of 1490 rows (individuals) and 8 variables (descriptors):
err
, the value of the error introducted in the paramater selection level. This corresponds to the universal noise term $n$ mentioned above.all
, the discriminant parental allele origin (1 = genotype A or 2 = genotype B or 3 = undetermined)a0
, the a0 parameter value.a1
, rhe a1 parameter value.k
, the k parameter value.m1
, the first moment of the single cell distribution produced by the simulation.m2
, the second moment of the single cell distribution produced by the simulation.m3
, the third moment of the single cell distribution produced by the simulation.Row names are the names of individuals.
In this tutorial, we only focus on individuals that have been generated with a level of error of 0.2 and, for computational reasons we downsample the data to 80 individuals.
err = 0.2 nb_indiv = 80 idx = which(indiv$err==err)[1:nb_indiv] cells = cells[idx] indiv = indiv[idx,] genodata = genodata[,idx] head(indiv)
The following figure shows the distribution of the first, second and third moment of the single cells data associated to each individual (grey bars). We also color the histograms according to the genotype at the locus of interest (allele A in red and allele B in blue). This corresponds to the discriminant parental allele origin, displayed in the the 'all' variable of the 'indiv' dataframe.
We observe that the distributions of the third moment differ according to genotypes.
layout(matrix(1:3, 1), respect=TRUE) for (m in c("m1","m2","m3")){ h = hist(indiv[[m]], nclass=20, main=m, xlab="", ylab="", col="grey") for (all in unique(indiv$all)) { h = hist(indiv[indiv$all==all,][[m]], plot=FALSE, breaks=h$breaks) lines(h, col=adjustcolor(all*2, alpha=0.3)) } }
genodata
specifies the genotype of the simulated individuals at 201 markers. It is
a data frame with 201 rows (the markers) and 1490 columns (individuals). Marker 101 corresponds
to the genotypes matching grous A and B and therefore represents a locus controlling the third moment of the distributions.
Genotypes at marker 101 is fully correlated with the value of the 'all' variable of the 'indiv' data
frame. From this marker 101, genotypes at other markers were generated by propagating (upstream and downstream) the parental genotypes to consecutive markers
with a recombination factor of 5 (corresponding to 5cM).
Names of columns and rows correspond to the names of individuals and markers, respectively.
genodata[96:106, 1:7]
The following plot represents the distribution of the recombination factor (teta) between each pair of consecutive markers. As expected, it is centered on 0.05 (i.e. 5cM).
tetas = compute_teta(genodata) plot(density(tetas), main=paste("Recombination Fraction Distribution"), xlab=paste("mean(teta)=", signif(mean(tetas),3), sep="")) abline(v=mean(tetas), lty=2)
The cells
object emdebs simulated single cell trait values obtained for each individual.
It is a list of 1490 items (one per individual), each item being a vector of 20000 integers
which correspond to the quantitative trait values of 20000 cells from that individual.
The cells
list is named with the corresponding individual names.
head(names(cells))
This plot shows the distribution of single cell values for each individual colored according the discriminant parental allele origin.
plot(0,0, col=0, xlim=c(-1.5,27), ylim=c(0,0.3), main="distribution of phenotypes", xlab="single cell value", ylab="density") foo = sapply(rownames(indiv), function(i) { lines(density(cells[[i]], bw=1), col=indiv[i,"all"]*2) })
The purpose of this tutorial is to present a general method allowing to map single-cell Probabilistic Trait Loci (scPTL).
Functions presented here are freely available in the R package ptlmapper
.
library(ptlmapper)
This tutorial is composed of the following steps:
For each individual, we construct one histogram that is based on the list of single-cell values.
We obtain a list of histograms (one per individual), that all share the same breaks
values.
Binning can be modified using the bin_width argument.
pheno_hists = build_pheno_hists(cells, bin_width=1)
From the list of histograms, we build the Kantorovitch distance matrix.
Since Kantorovitch distance between two individuals is computed using their histograms, it is important that the breaks
values remains identical between individuals.
This matrix is a symmetric distance matrix where columns and rows are named after the corresponding individuals.
kd_matrix = build_kd_matrix(pheno_hists) kd_matrix[1:6,1:6]
From the list of single cell values stored in the histograms we build the multivariate moments matrix.
This matrix is composed of 1 line per individual (named after the individual) and nb_moment
columns named from moment_1
to moment_nb
.
nb_moment is highest-order moment you want to compute.
mm_matrix = build_mmoments_matrix(pheno_hists, nb_moment=4) head(mm_matrix)
The methods needs an annotated genome to work properly. Here, we add four colums to the genodata dataframe:
prob_name
, the probe name associated to the marker.chromosome
, the chromosome on which the marker is.position
, the position of the marker on the chromosome.rec_fractions
, the recombination factor associated to the marker.In genodata
, "2" stands for undetermined and is replaced by NA (missing data).
bckg = names(cells) genodata$chromosome = 1 genodata$position = 1:nrow(genodata) genodata$prob_name = rownames(genodata) genodata$rec_fractions = 0.05 genodata_ptl = preprocess_genodata(genodata, bckg)
The plt_mapping
function scans the genome to detect scPTLs using one of the two methods: kanto
or mmoments
.
Please refer to publication Chuffart et al. 2016 for more details.
The common algorithm is the following:
Phenotypes of individuals are described by a vector (PCA on mm_matrix or MDS on kd_matrix).
The algorithm determines with the Kaiser criterion which dimensions of this orthogonal transformation are relevant for canonical analysis (eig_to_scan
)
If nb_dim=0
, a z-score criterion is used to determine how many dimensions the canonical analysis should use to give the best result. The main purpose is to fixe nb_dim
in 2:max(eig_to_scan)
.
If nb_dim=NULL
, nb_dim
is automatically set to max(eig_to_scan)
.
If nb_dim>0
the canonical analysis is performed using the provided nb_dim
. In this two last cases, keep in mind that a high value of nb_dim
introduces a high degree of freedom in your canonical analysis. The number of dimensions nb_dim
needs to be substantially smaller than the size of your population.
kanto_analysis = ptl_scan(kd_matrix, genodata_ptl, method="kanto") mmoments_analysis = ptl_scan(mm_matrix, genodata_ptl, method="mmoments")
The ptlmapper
package embed features of the widely-used r/qtl
package [Broman et al. (2003)]. It allows to perform classic QTL mapping on univariate criteria (ex: mean, variance) using ptlmapper
data structures.
rqtl_analysis = rqtl_launch(genodata_ptl, pheno_hists, kanto_analysis, mmoments_analysis)
All these steps are embeded in the ptl_mapping
function.
This function is a workflow that aggregates parameter values, inputs, and outputs of the call, to the R package ptlmapper
function.
The resulting data structure could be cache (i.e. stored in memory) on the file system (using ptl_mapping_filename
argument of the function).
It is particularly useful to massivelly save multiple calls to plt_scan
and rqtl_launch
function, for example in a complex distributed design adressing many single phenotypes on a given population.
ptl_mapping_result = ptl_mapping(genodata, cells, bckg, nb_perm=20, bin_width=1)
These results concern univariate phenotype analysis. This figure displays genome scans for mean, var and noise QTL mapping.
layout(matrix(1:3, 1, byrow=TRUE), respect=TRUE) plot_rqtl(ptl_mapping_result, which_pheno=1) plot_rqtl(ptl_mapping_result, which_pheno=2) plot_rqtl(ptl_mapping_result, which_pheno=3)
These results concern multivariate approach with the two methods "kanto" and "mmoments".
The function get_best_markers_rptl
returns the details of marker that obtains the best scPTL score according to the method
used.
best_marker_kanto = get_best_markers_rptl(ptl_mapping_result, method="kanto") print(best_marker_kanto)
The first figure displays individuals (colored by their genotype at the marker having highest linkage score) in the two first dimensions of the phenotypic space (this space was produced by multidimensional scaling). The second figure shows scPTL scan of the genome obtained by using the "kanto" method. The third figure displays the phenotypic discrimination produced by the marker that obtains the best PTL score by using the "kanto" method.
layout(matrix(1:3, 1, byrow=TRUE), respect=TRUE) best_marker_name = rownames(best_marker_kanto)[1] col = marker2col(ptl_mapping_result, best_marker_name) plot_orth_trans(ptl_mapping_result, col=col, method="kanto") plot_wilks(ptl_mapping_result, method="kanto", ylim=c(0,10)) plot_can(ptl_mapping_result, best_marker_name, col=col, method="kanto")
To extract the results obtained by this method, we type:
best_marker_mmoments = get_best_markers_rptl(ptl_mapping_result, method="mmoments") print(best_marker_mmoments)
The first figure displays individuals (colored by their genotype at the marker having highest linkage score) along the two first components of the PCA. The second figure shows scPTL mapping of the genome obtained by using the "mmoments" method. The third figure displays the phenotypic discrimination produced by the marker that obtains the best PTL score by using the "mmoments" method.
layout(matrix(1:3, 1, byrow=TRUE), respect=TRUE) best_marker_name = rownames(best_marker_mmoments)[1] col = marker2col(ptl_mapping_result, best_marker_name) plot_orth_trans(ptl_mapping_result, col=col, method="mmoments") plot_wilks(ptl_mapping_result, method="mmoments", ylim=c(0,10)) plot_can(ptl_mapping_result, best_marker_name, col=col, method="mmoments")
We are grateful to the Pole Scientifique de Modelisation Numerique (Lyon, France) and the Grid'5000 experimental testbed (www.grid5000.fr) for computer resource, and the Centre Blaise Pascal of ENS de Lyon for housing the code on a forge. This work was supported by the European Research Council under the European Union's Seventh Framework Programme FP7/2007-2013 [grant SiGHT number 281359].
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.