Introduction

Dataset

This is a tutorial describing how to use ptlmapper. The tutorial is based on the analysis of a simulated dataset.

The dataset was generated from a simulator. The implemented model described the expression of a single gene with positive autoregulation. Protein production followed the equation: [P = \frac{\alpha_0 + \alpha_1 (\frac{P}{K})^h}{1+(\frac{P}{K})^h}]

We considered two different genotypic groups, that were characterized by the following sets of parameters:

[\left\lbrace \begin{array}{l} \alpha_0 = 0.1 \ \alpha_1 = 10 \ K = 1.6 \end{array} \right.]

[\left\lbrace \begin{array}{l} \alpha_0 = 6.3\ \alpha_1 = 12\ K = 10 \end{array} \right.]

We assumed that the set of parameters was controlled by one locus that could exist in two different allelic forms (A and B).

For each individual, we generated 20,000 cells according a given set of parameter values. In addition, we added a universal noise term $n$ to introduce intra-genotype inter-individual varation. This correspond to variations of the level of errors on the parameter values.

We considered a panel of individuals and their genotype at 200 markers evenly spaced every 5cM.

For simplicity, individuals were considered as haploid segregants derived from a cross between two haploid parents of genotype A and B, respectively.

The dataset is available as an R package called ptldata (https://github.com/fchuffar/ptldata).

The ptldata package contains the following objects:

library(ptldata)
indiv = ptldata::indiv
genodata = ptldata::genodata
cells = ptldata::cells

which are described below.

indiv

indiv displays relevant information on all individuals. It is a dataframe composed of 1490 rows (individuals) and 8 variables (descriptors):

Row names are the names of individuals.

In this tutorial, we only focus on individuals that have been generated with a level of error of 0.2 and, for computational reasons we downsample the data to 80 individuals.

err = 0.2
nb_indiv = 80
idx = which(indiv$err==err)[1:nb_indiv]
cells = cells[idx]
indiv = indiv[idx,]
genodata = genodata[,idx]
head(indiv)

The following figure shows the distribution of the first, second and third moment of the single cells data associated to each individual (grey bars). We also color the histograms according to the genotype at the locus of interest (allele A in red and allele B in blue). This corresponds to the discriminant parental allele origin, displayed in the the 'all' variable of the 'indiv' dataframe.

We observe that the distributions of the third moment differ according to genotypes.

layout(matrix(1:3, 1), respect=TRUE)
for (m in c("m1","m2","m3")){  
  h = hist(indiv[[m]], nclass=20, main=m, xlab="", ylab="", col="grey")
  for (all in unique(indiv$all)) {
    h = hist(indiv[indiv$all==all,][[m]], plot=FALSE, breaks=h$breaks)    
    lines(h, col=adjustcolor(all*2, alpha=0.3))
  }
}

genodata

genodata specifies the genotype of the simulated individuals at 201 markers. It is a data frame with 201 rows (the markers) and 1490 columns (individuals). Marker 101 corresponds to the genotypes matching grous A and B and therefore represents a locus controlling the third moment of the distributions. Genotypes at marker 101 is fully correlated with the value of the 'all' variable of the 'indiv' data frame. From this marker 101, genotypes at other markers were generated by propagating (upstream and downstream) the parental genotypes to consecutive markers with a recombination factor of 5 (corresponding to 5cM). Names of columns and rows correspond to the names of individuals and markers, respectively.

genodata[96:106, 1:7]

The following plot represents the distribution of the recombination factor (teta) between each pair of consecutive markers. As expected, it is centered on 0.05 (i.e. 5cM).

tetas = compute_teta(genodata)
plot(density(tetas), 
  main=paste("Recombination Fraction Distribution"), 
  xlab=paste("mean(teta)=", signif(mean(tetas),3), sep=""))
abline(v=mean(tetas), lty=2)

cells

The cells object emdebs simulated single cell trait values obtained for each individual. It is a list of 1490 items (one per individual), each item being a vector of 20000 integers which correspond to the quantitative trait values of 20000 cells from that individual.

The cells list is named with the corresponding individual names.

head(names(cells))

This plot shows the distribution of single cell values for each individual colored according the discriminant parental allele origin.

plot(0,0, col=0, xlim=c(-1.5,27), ylim=c(0,0.3), 
  main="distribution of phenotypes", xlab="single cell value", ylab="density")
foo = sapply(rownames(indiv), function(i) {
  lines(density(cells[[i]], bw=1), col=indiv[i,"all"]*2)
})

Method

The purpose of this tutorial is to present a general method allowing to map single-cell Probabilistic Trait Loci (scPTL). Functions presented here are freely available in the R package ptlmapper.

library(ptlmapper)

This tutorial is composed of the following steps:

Analysis of the single cell signal (phenotype)

For each individual, we construct one histogram that is based on the list of single-cell values. We obtain a list of histograms (one per individual), that all share the same breaks values. Binning can be modified using the bin_width argument.

pheno_hists = build_pheno_hists(cells, bin_width=1)

Computation of the Kantorovitch distance matrix

From the list of histograms, we build the Kantorovitch distance matrix. Since Kantorovitch distance between two individuals is computed using their histograms, it is important that the breaks values remains identical between individuals. This matrix is a symmetric distance matrix where columns and rows are named after the corresponding individuals.

kd_matrix = build_kd_matrix(pheno_hists)
kd_matrix[1:6,1:6]

Computation of the multivariate moments matrix

From the list of single cell values stored in the histograms we build the multivariate moments matrix. This matrix is composed of 1 line per individual (named after the individual) and nb_moment columns named from moment_1 to moment_nb. nb_moment is highest-order moment you want to compute.

mm_matrix = build_mmoments_matrix(pheno_hists, nb_moment=4)
head(mm_matrix)

Preparation of the genotype

The methods needs an annotated genome to work properly. Here, we add four colums to the genodata dataframe:

In genodata, "2" stands for undetermined and is replaced by NA (missing data).

bckg = names(cells)
genodata$chromosome = 1
genodata$position = 1:nrow(genodata)
genodata$prob_name = rownames(genodata)
genodata$rec_fractions = 0.05
genodata_ptl = preprocess_genodata(genodata, bckg)

Perform scPTL mapping using Kantorovich and multivariate moments method

The plt_mapping function scans the genome to detect scPTLs using one of the two methods: kanto or mmoments. Please refer to publication Chuffart et al. 2016 for more details. The common algorithm is the following:

Phenotypes of individuals are described by a vector (PCA on mm_matrix or MDS on kd_matrix). The algorithm determines with the Kaiser criterion which dimensions of this orthogonal transformation are relevant for canonical analysis (eig_to_scan) If nb_dim=0, a z-score criterion is used to determine how many dimensions the canonical analysis should use to give the best result. The main purpose is to fixe nb_dim in 2:max(eig_to_scan). If nb_dim=NULL, nb_dim is automatically set to max(eig_to_scan). If nb_dim>0 the canonical analysis is performed using the provided nb_dim. In this two last cases, keep in mind that a high value of nb_dim introduces a high degree of freedom in your canonical analysis. The number of dimensions nb_dim needs to be substantially smaller than the size of your population.

kanto_analysis = ptl_scan(kd_matrix, genodata_ptl, method="kanto")
mmoments_analysis = ptl_scan(mm_matrix, genodata_ptl, method="mmoments")

Perform the qtl mapping using the classic method

The ptlmapper package embed features of the widely-used r/qtl package [Broman et al. (2003)]. It allows to perform classic QTL mapping on univariate criteria (ex: mean, variance) using ptlmapper data structures.

rqtl_analysis = rqtl_launch(genodata_ptl, pheno_hists, kanto_analysis, mmoments_analysis)

Workflow

All these steps are embeded in the ptl_mapping function. This function is a workflow that aggregates parameter values, inputs, and outputs of the call, to the R package ptlmapper function. The resulting data structure could be cache (i.e. stored in memory) on the file system (using ptl_mapping_filename argument of the function). It is particularly useful to massivelly save multiple calls to plt_scan and rqtl_launch function, for example in a complex distributed design adressing many single phenotypes on a given population.

ptl_mapping_result = ptl_mapping(genodata, cells, bckg, nb_perm=20, bin_width=1)

Result

These results concern univariate phenotype analysis. This figure displays genome scans for mean, var and noise QTL mapping.

QTL

layout(matrix(1:3, 1, byrow=TRUE), respect=TRUE)
plot_rqtl(ptl_mapping_result, which_pheno=1)
plot_rqtl(ptl_mapping_result, which_pheno=2)
plot_rqtl(ptl_mapping_result, which_pheno=3)

scPTL

These results concern multivariate approach with the two methods "kanto" and "mmoments".

Kantorovich method

The function get_best_markers_rptl returns the details of marker that obtains the best scPTL score according to the method used.

best_marker_kanto = get_best_markers_rptl(ptl_mapping_result, method="kanto")
print(best_marker_kanto)

The first figure displays individuals (colored by their genotype at the marker having highest linkage score) in the two first dimensions of the phenotypic space (this space was produced by multidimensional scaling). The second figure shows scPTL scan of the genome obtained by using the "kanto" method. The third figure displays the phenotypic discrimination produced by the marker that obtains the best PTL score by using the "kanto" method.

layout(matrix(1:3, 1, byrow=TRUE), respect=TRUE)
best_marker_name = rownames(best_marker_kanto)[1]
col = marker2col(ptl_mapping_result, best_marker_name)
plot_orth_trans(ptl_mapping_result, col=col, method="kanto")
plot_wilks(ptl_mapping_result, method="kanto", ylim=c(0,10))
plot_can(ptl_mapping_result, best_marker_name, col=col, method="kanto")
multivariate moments method

To extract the results obtained by this method, we type:

best_marker_mmoments = get_best_markers_rptl(ptl_mapping_result, method="mmoments")
print(best_marker_mmoments)

The first figure displays individuals (colored by their genotype at the marker having highest linkage score) along the two first components of the PCA. The second figure shows scPTL mapping of the genome obtained by using the "mmoments" method. The third figure displays the phenotypic discrimination produced by the marker that obtains the best PTL score by using the "mmoments" method.

layout(matrix(1:3, 1, byrow=TRUE), respect=TRUE)
best_marker_name = rownames(best_marker_mmoments)[1]
col = marker2col(ptl_mapping_result, best_marker_name)
plot_orth_trans(ptl_mapping_result, col=col, method="mmoments")
plot_wilks(ptl_mapping_result, method="mmoments", ylim=c(0,10))
plot_can(ptl_mapping_result, best_marker_name, col=col, method="mmoments")

Bibliography

Acknowledgment

We are grateful to the Pole Scientifique de Modelisation Numerique (Lyon, France) and the Grid'5000 experimental testbed (www.grid5000.fr) for computer resource, and the Centre Blaise Pascal of ENS de Lyon for housing the code on a forge. This work was supported by the European Research Council under the European Union's Seventh Framework Programme FP7/2007-2013 [grant SiGHT number 281359].



fchuffar/ptlmapper documentation built on March 27, 2024, 3:28 p.m.