Fantasio is composed of several functions. Its goals are:
Fantasio implements the creation of several random sparse submaps on genome-wide data (to remove linkage disequilibrium). It also provides graphical outputs to facilitate interpretations of homozygosity mapping results and plots.
In this vignette, we illustrate how to use the package, using the data set HGDP-CEPH, a ship which contains 1043 individuals and 660918 markers. In order to access the data, you will need to download the HGDP.CEPH package (see below how) and load it.
Not all options of the functions are described here, but rather their basic usage. The reader is advised to look at the manual page of the function for details.
Fantasio implements a maximum likelihood method that uses a hidden Markov chain to model the dependencies along the genome between the (observed) marker genotypes of an individual, and its (unobserved) homozygous by descent (HBD) status. The emission probabilities of this hidden Markov model (HMM) depend on the allele frequencies. The transition probabilities depend on the genetic distance between two adjacent markers. This model allows estimating the inbreeding coefficient \(f\) of an individual, and a parameter \(a\), where \(af\) is the instantaneous rate of change per unit map length (here cM) from no HBD to HBD. Both HBD and non-HBD segment lengths are assumed to be distributed exponentially with mean lengths $\frac{1}{a(1-f)}$ and $\frac{1}{af}$, respectively.
The method requires the markers to be in minimal linkage disequilibrium (LD). Otherwise biased estimations of \(f\) are produced. A strategy consisting of generating multiple random sparse genome maps (submaps) has been proposed to avoid this bias (Leutenegger et al. 2011). When several submaps are considered, \(f\) is estimated as the median value on all the f estimation obtained on the different maps after removing submaps with \(a > 1\) (having an average HBD segment length of 1 cM is unlikely to be detected with a SNP density of 1 per 0.5 cM). This strategy has the advantage of not requiring any LD computation on the sample and of minimizing loss of information, as compared with a strategy that based on a single map of markers in minimal LD.
Fantasio statistical framework allows fixing HMM parameters to compute the likelihood of a mating type. These likelihoods can be used for:
When multiple submaps are used, the median p-values/probabilities are considered. See Leutenegger et al. 2011 for more details on the calculations.
Homozygosity mapping (Lander and Botstein 1987) consists in focusing on inbred affected individuals and searching for a region of the genome of shared homozygosity. Original homozygosity mapping requires that the genealogy of patients be known so that inbred patients can be identified and their respective \(f\) estimated. Leutenegger et al. (Leutenegger et al. 2006) proposed using the \(f\) estimated on genome-wide genetic data to compute a FLOD score, similar to Morton's LOD score (Morton 1955).
Genin et al. (Genin et al. 2012) adapted the FLOD formula for multiple submaps. $FLOD^{(i)}(m,s)$ is computed for each individual \(i\), each marker \(m\) on each submap \(s\), using the equation (1):
$FLOD^{(i)}(m,s) = log_{10}\frac{P\left(Y_{m,s}^{(i)} | H_{1}\right)}{P\left(Y_{m,s}^{(i)} | H_{0}\right)} = log_{10}\frac{P\left(X_{m,s}^{(i)}=1 | Y_{m,s}^{(i)}\right) + q.P\left(X_{m,s}^{(i)}=0 | Y_{m,s}^{(i)}\right)}{\hat{f}_s^{(i)} + q.\left(1-\hat{f}_s^{(i)}\right)}$
With the following parameters :
Results are then averaged over the different submaps to obtain a single \(FLOD^{(i)}(m)\) at each marker m.
Genin et al. (Genin et al. 2012) proposed to detect fully penetrant rare recessive variants by performing homozygosity mapping on inbred cases from Genome-Wide Association Study (GWAS) data. Linkage evidence is then evaluated over the entire set I of inbred cases by computing a FLOD score, $HFLOD(m,\alpha))$, at each marker m, in presence of genetic heterogeneity using a parameter $(\alpha)$ (2):
$HFLOD(m,\alpha)=\sum log_{10} \left[\alpha.\frac{P\left(Y_{m,s}^{(i)} | H_{1}\right)}{P\left(Y_{m,s}^{(i)} | H_{0}\right)}+ (1 - \alpha)\right ]= \sum log_{10} \left[\alpha . exp \left(FLOD^{(i)}(m) \times \log(10)\right)+(1-\alpha)\right]$
This heterogeneity score is then maximized over $(\alpha)$ to evaluate the evidence of linkage at marker m where $(\alpha)$ is the estimate of the proportion of cases linked to this locus (3):
$HFLOD(m)=max_{\alpha}(HFLOD(m,\alpha))$
The package Fantasio
depends on gaston
. Hereafter the functions of
this package are used for data manipulation. If you are not familiar
with gaston
, please refer to the vignette of this package for more
information.
First and foremost install the package with the following command:
devtools::install_github("genostats/Fantasio@margot", build_vignettes = TRUE)
After that we can load the package.
require(Fantasio) ## Loading required package: Fantasio ## Loading required package: parallel ## Loading required package: gaston ## Loading required package: Rcpp ## Loading required package: RcppParallel ## ## Attaching package: 'RcppParallel' ## The following object is masked from 'package:Rcpp': ## ## LdFlags ## Gaston set number of threads to 8. Use setThreadOptions() to modify this. ## ## Attaching package: 'gaston' ## The following object is masked from 'package:stats': ## ## sigma ## The following objects are masked from 'package:base': ## ## cbind, rbind
We illustrate the usage of the package with the data provided in the data package HGDP-CEPH. First, we need to install it:
install.packages("HGDP.CEPH", repos="https://genostats.github.io/R/")
We can load the HGDP-CEPH data as follows:
require(HGDP.CEPH) ## Loading required package: HGDP.CEPH filepath <- system.file("extdata", "hgdp_ceph.bed", package="HGDP.CEPH") x <- read.bed.matrix(filepath) ## Reading /home/margot/R/x86_64-pc-linux-gnu-library/3.6/HGDP.CEPH/extdata/hgdp_ceph.rds ## Reading /home/margot/R/x86_64-pc-linux-gnu-library/3.6/HGDP.CEPH/extdata/hgdp_ceph.bed
The object x
is a bed.matrix object (see gaston package for details).
The statistics on SNPs and individual callrates, etc, have not been
computed; run
x <- set.stats(x) ## ped stats and snps stats have been set. ## 'p' has been set. ## 'mu' and 'sigma' have been set.
to add them in the bed matrix. These data have genetic distances
(centiMorgans) set in the @snps
slot, column dist
:
head(x@snps) ## chr id dist pos A1 A2 N0 N1 N2 NAs N0.f N1.f N2.f NAs.f ## 1 1 rs3094315 0.09162711 742429 C T 75 346 620 2 NA NA NA NA ## 2 1 rs12562034 0.09918407 758311 A G 81 275 674 13 NA NA NA NA ## 3 1 rs3934834 0.49630053 995669 T C 58 268 717 0 NA NA NA NA ## 4 1 rs9442372 0.50394850 1008567 A G 149 423 471 0 NA NA NA NA ## 5 1 rs3737728 0.50800870 1011278 T C 45 290 708 0 NA NA NA NA ## 6 1 rs11260588 0.50867275 1011521 A G 3 49 991 0 NA NA NA NA ## callrate maf hz ## 1 0.9980825 0.23823247 0.33237272 ## 2 0.9875360 0.21213592 0.26699029 ## 3 1.0000000 0.18408437 0.25695110 ## 4 1.0000000 0.34563758 0.40556088 ## 5 1.0000000 0.18216683 0.27804410 ## 6 1.0000000 0.02636625 0.04697987
Most datasets come with physical positions (column pos
in @snps
),
but no genetic distance. They can be added with Gaston. First, you need
install the HumanGeneticMap package with
install.packages("HumanGeneticMap", repos="https://genostats.github.io/R/")
.
Then, you would run
x <- set.dist(x, HumanGeneticMap::genetic.map.b36) ## Setting distances for chromosome 1 ## Setting distances for chromosome 2 ## Setting distances for chromosome 3 ## Setting distances for chromosome 4 ## Setting distances for chromosome 5 ## Setting distances for chromosome 6 ## Setting distances for chromosome 7 ## Setting distances for chromosome 8 ## Setting distances for chromosome 9 ## Setting distances for chromosome 10 ## Setting distances for chromosome 11 ## Setting distances for chromosome 12 ## Setting distances for chromosome 13 ## Setting distances for chromosome 14 ## Setting distances for chromosome 15 ## Setting distances for chromosome 16 ## Setting distances for chromosome 17 ## Setting distances for chromosome 18 ## Setting distances for chromosome 19 ## Setting distances for chromosome 20 ## Setting distances for chromosome 21 ## Setting distances for chromosome 22 ## No map for chromosome 23 ## No map for chromosome 24 ## No map for chromosome 25 ## No map for chromosome 26
as the SNP positions in these data are on hg18/b36 reference.
We are going to work on the Bedouin population. To select this population run:
x.be <- select.inds(x, population == "Bedouin")
Two differents methods for submaps creation are implemented in the package:
"By Hotspots": with this method we use recombination hotspots to
segment the genome. The package contains recombination hotspots for
the Human Genome in builds 35 (hg17), 36 (hg18) and 37 (hg19) in
datasets hotspots_hg17
, hotspots_hg18
,hotspots_hg19
. The
default recombination hotspots dataset used is hotspots_hg19
.
Segments should contain at least a user defined number of markers.
Markers are then randomly selected within each segment along the
genome. By doing this process we obtain a submap (a list of marker).
"By Distance" or "by SNPs": with this method we use a fixed step based on genetic or physiscal distance (0.5 cM by default) to pick a marker randomly along the genome. More technically, segments are created whenever there is a gap larger than the step (0.5 cM) between adjacent markers. Each segment is then subdivided in several mini-segments. By default we create 20 mini-segments, each containing at least 50 markers. If this is not possible (not enough markers), we do not create mini-segments. After this process is done, we loop over the mini-segments, pick a random marker and walk through the mini-segments by picking the nearest marker after taking a step (default 0.5 cM) downstream and upstream the mini-segments.
The function Fantasio
calls two different functions for submaps
creation, based on the value of segments
: segmentsListByHotspots
and
segmentsListByDistance
. The first function segmentsListByDistance
is
used to create a list of segments though the genome. The second function
makeAtlasByDistance
or makeAllSubsmapsbyHotspots
is used to create
the submaps.
By default, the submaps are created using the file of recombination hotspots and summarizing the results for each snp that appears in a submap.
F1 <- Fantasio(bedmatrix=x.be, segments="Hotspots", segment.options = list(hotspots = hotspot_hg18), n = 100, verbose = FALSE) ## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2. ## Using all inbred individuals with good estimation quality.
The function markerRepresentation
gives for each marker of the
original bed matrix the number of times it has been selected. It can be
used in conjonction with table
to obtain a summary of the number of
markers never seen, seen once, twice, etc:
head(markerRepresentation(F1), 10) ## id chr pos dist Freq ## 1 rs3094315 1 742429 0.09162711 2 ## 2 rs12562034 1 758311 0.09918407 2 ## 3 rs3934834 1 995669 0.49630053 0 ## 4 rs9442372 1 1008567 0.50394850 2 ## 5 rs3737728 1 1011278 0.50800870 6 ## 6 rs11260588 1 1011521 0.50867275 0 ## 7 rs9442398 1 1011558 0.50878962 0 ## 8 rs6687776 1 1020428 0.54737207 0 ## 9 rs9651273 1 1021403 0.54793723 1 ## 10 rs4970405 1 1038818 0.55291100 2 table(markerRepresentation(F1)$Freq) ## ## 0 1 2 3 4 5 6 7 8 9 10 ## 217897 154511 101646 62532 39023 24346 16145 10899 7659 5467 4090 ## 11 12 13 14 15 16 17 18 19 20 21 ## 3081 2267 1771 1474 1107 947 774 632 549 459 369 ## 22 23 24 25 26 27 28 29 30 31 32 ## 336 286 274 207 210 166 147 110 132 115 109 ## 33 34 35 36 37 38 39 40 41 42 43 ## 103 67 68 68 48 31 32 22 33 27 20 ## 44 45 46 47 48 49 50 51 52 53 54 ## 30 35 28 30 40 42 38 42 38 27 27 ## 55 56 57 58 59 60 61 62 63 64 68 ## 30 25 14 9 3 3 4 3 3 2 1 ## 100 ## 188
The slot F1@submap_summary
contains a summmary of results accross
submaps for all individuals:
head(F1@submap_summary, 10) ## famid id pheno submaps quality f_min f_max f_mean ## 1 HGDP00607 HGDP00607 1 100 100 0.01968740 0.03266918 0.0249375555 ## 2 HGDP00608 HGDP00608 1 100 100 0.03329754 0.04466905 0.0378773806 ## 3 HGDP00609 HGDP00609 1 100 100 0.03555332 0.04570728 0.0395621560 ## 4 HGDP00610 HGDP00610 1 100 100 0.04405081 0.05855172 0.0496007528 ## 5 HGDP00611 HGDP00611 1 96 96 0.00000000 0.01027520 0.0003100903 ## 6 HGDP00612 HGDP00612 1 100 100 0.04563621 0.06452940 0.0541143211 ## 7 HGDP00613 HGDP00613 1 100 100 0.00000000 0.00000000 0.0000000000 ## 8 HGDP00614 HGDP00614 1 100 100 0.02374184 0.03547134 0.0289076895 ## 9 HGDP00615 HGDP00615 1 100 100 0.08069546 0.09608753 0.0888941435 ## 10 HGDP00616 HGDP00616 1 100 100 0.06384241 0.08046822 0.0709961142 ## f_median a_median pLRT_median inbred pLRT_inf_0.05 ## 1 0.02456151 0.13107792 1.534069e-37 TRUE 100 ## 2 0.03771483 0.06290722 4.902576e-69 TRUE 100 ## 3 0.03913424 0.10664411 1.144885e-72 TRUE 100 ## 4 0.04932863 0.15372793 1.366378e-79 TRUE 100 ## 5 0.00000000 0.01000000 1.000000e+00 FALSE 1 ## 6 0.05363621 0.28137287 5.006136e-56 TRUE 100 ## 7 0.00000000 0.01000000 1.000000e+00 FALSE 0 ## 8 0.02847253 0.13542484 6.424485e-41 TRUE 100 ## 9 0.08883544 0.09133073 5.673614e-168 TRUE 100 ## 10 0.07040706 0.15074459 3.685643e-104 TRUE 100
If you are interested in the values of \(f\) and \(a\) accross the
submaps, you can find them in F1@estimation_summary
.
The function also computed HBD probabilities that can be found in
F1@HBD_recap
, which are used to determine the position of HBD
segments, which are recapitulated in F1@HBDsegments
. It is required
that at least n.consecutive.marker
markers are HBD before calling a
HBD segment. Default value isn.consecutive.marker=5
.
Moreover, FLOD and HFLOD linkage statistics are computed. The FLOD of
each individual is in F1@FLOD_recap
; they are used to compute the
whole sample statistics HFLOD in F1@HFLOD
.
By default, the package only computes HBD, FLOD statistics and HFLOD
statistics for affected individuals (phenotype = 2). In case there are
no affected individuals, all individuals are taken into account (a
warning is issued). The list.id
argument allows to specify a list of
individuals.
For the "Hotspots" method, the results can also be summarized globally
for each segment using option recap.by.segments=TRUE
. In that case,
n.consecutive.marker
should be set to 1.
F2 <- Fantasio(bedmatrix = x.be, segments = "Hotspots", segment.options = list(hotspots = hotspot_hg18), n = 100, recap.by.segments = TRUE, n.consecutive.marker = 1, verbose = FALSE) ## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2. ## Using all inbred individuals with good estimation quality.
F3 <- Fantasio(bedmatrix=x.be, segments="Distance", n = 100, verbose = FALSE) ## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2. ## Using all inbred individuals with good estimation quality.
In addition to all the above mentionned summaries, the function
markerSummary
returns the number of SNPs in each submap. Note that
when `segments = "Hotspots", it's always the number of hotspot
segments, but here it varies from submap to submap.
head(markerSummary(F3)) ## number_of_markers_used ## Submap 1 6229 ## Submap 2 6239 ## Submap 3 6236 ## Submap 4 6238 ## Submap 5 6221 ## Submap 6 6233
To run logistic regression use run.logistic = TRUE
in Fantasio
function. (default = FALSE)
Specify your explanatory variable in expl_var
. It can be HBD_prob
or FLOD
.
You can import a dataframe with covariates such as (age, sex, BMI ...) incovariates
argument to run adjusted glm.
segment.option
argumentIn order to use the segment.option
argument you need to pass a list of
arguments, each variable name in the list must be an argument name in
the function. The function that will be called is either
createsSegmentsListByDistance
if segments
argument is equal to
"Distance" or segmentsListByHotspots
if segments
argument is equal
to "Hotspots" and the arguments list will be passed to it. So refer to
these functions for possible arguments.
l <- list(hotspots = hotspot_hg18, minMarkers = 50) # default is 0 F4 <- Fantasio(bedmatrix = x.be, segments = "Hotspots", n = 100, segment.options = l, verbose = FALSE) ## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2. ## Using all inbred individuals with good estimation quality.
In the case of "Hotspots", by default, we do not require to have a
minimum number of markers in each segment (minMarkers = 0
). With the
above command line, we impose to have at least 50 markers in a segment.
Otherwise it is merged with the previous segment.
We illustrate the plotting function on the F1
object obtained with
segments="Hotspots"
and F2
obtained with segments="Hotspots"
and
recap.by.segments=TRUE
.
This plot shows the values of the HFLOD linkage statistics along the
genome (contained in F1@HFLOD
).
HFLODManhattanPlot(F1)
The points are the HFLOD values. In the case of F1
, they are computed
at every marker included in at least one submaps. The red line is the
value of their moving average (see below).
To plot a single chromosome, use the following function:
HFLODPlotChr(F1, chr=15)
The red line plotted is the moving average of the HFLOD, calculated on moving windows of 50 markers. This allows checking the consistency of HFLOD calculations (i.e. checking the fact that a high HFLOD score is not due to one submap only).
For method "Hotspots by segments", the use of the moving average
does not make much sense as results are already averaged over the
SNPs of a segment between hospots regions. We recommend using
MA=FALSE
as this:
HFLODPlotChr(F2, chr=15, MA=FALSE)
The HBD plots allow to visualize the positions of the HBD segments on
the genome, contained in F1@HBDsegments
. You can plot multiple
individuals for one chromosome (by default, all individuals with \(f\)
significantly positive are plotted), as illustrated below with F1
and
F2
:
HBDPlotChr(F1, chr=15)
HBDPlotChr(F2, chr=15)
It is also possible to look at all HBD segments of a given individual
(again we show the results for F1
and F2
):
HBDPlotId(F1, id = "HGDP00649", famid = "HGDP00649")
HBDPlotId(F2, id = "HGDP00649", famid = "HGDP00649")
Current data doesn't allow to use this funcitonnality of Fantasio but
- when it's allowed you can plot QQ-plots and Manhattan-plots using the function glmHBDPlot
. See documentation about this function for more information.
We will now create segments, which will be use to create the submaps later, further explication below, for now use this command :
s1 <- segmentsListByHotspots(x.be, hotspots = hotspot_hg18) ## Using hotspots from hotspot_hg18 ## Gathering all hotspots for the genome : ...................... ## Gathering all the genome's markers : ...................... ## Finding which markers are between two hotspots : ......................
This function creates a list of chromosomes, in each, you have a list of several segments created thanks to the hotspots file given in argument (files are given with the package), in each segments you have SNPs index.
You can watch a summary of what was done with :
segmentsListSummary(s1) ## chromosome number_of_segments number_of_markers ## 1 1 1143 47146 ## 2 2 1075 51611 ## 3 3 939 42555 ## 4 4 882 38215 ## 5 5 870 39191 ## 6 6 834 41404 ## 7 7 726 34011 ## 8 8 706 35659 ## 9 9 699 29513 ## 10 10 782 32753 ## 11 11 663 30521 ## 12 12 739 30331 ## 13 13 574 23931 ## 14 14 481 20314 ## 15 15 509 18429 ## 16 16 549 18487 ## 17 17 475 15527 ## 18 18 552 18911 ## 19 19 323 10004 ## 20 20 465 15759 ## 21 21 262 9025 ## 22 22 271 8956
This function creates a dataframe with three colums :
We will now head toward the creation of submaps using the following commands :
F1 <- makeAtlasByHotspots(x.be, 5, s1) F1 <- festim(F1) F1 <- setSummary(F1, list.id = "all")
For the sake of clarity we have only created 5 submaps, but generally we do 100.
This function will creates 5 submaps, all the parameters can be modified (use args(makeAtlasByHotspots) for more informations).
The variable submaps becomes an submapsList object, you can watch the different elements of it with :
str(F1) #careful it can become huge depending on your data sizes
This object contains all the results of the different computations executed during the process of creating n submaps. Here is a complete description of each structure in this object :
str(F1@segments_list) #careful it can become huge depending on your data sizes
str(F1@submaps_list) #careful it can become huge depending on your data sizes
snsp.matrix
or a
HostspotsMatrix
(here we use the hotspots method). Each submaps
contains 15 slots :str(F1@likelihood_summary)
str(F1@estimation_summary)
head(F1@submap_summary) ## famid id pheno submaps quality f_min f_max f_mean ## 1 HGDP00607 HGDP00607 1 5 100 0.02307054 0.028227601 0.024997474 ## 2 HGDP00608 HGDP00608 1 5 100 0.03481927 0.040759254 0.038219872 ## 3 HGDP00609 HGDP00609 1 5 100 0.03694378 0.040824604 0.038914434 ## 4 HGDP00610 HGDP00610 1 5 100 0.04457247 0.053895480 0.050580150 ## 5 HGDP00611 HGDP00611 1 5 100 0.00000000 0.005577867 0.001115573 ## 6 HGDP00612 HGDP00612 1 5 100 0.05372673 0.061493384 0.056387572 ## f_median a_median pLRT_median inbred pLRT_inf_0.05 ## 1 0.02363142 0.12652141 7.107875e-39 TRUE 5 ## 2 0.03775612 0.06013269 6.020340e-64 TRUE 5 ## 3 0.03849243 0.10540997 3.697390e-74 TRUE 5 ## 4 0.05170420 0.15971819 1.573828e-79 TRUE 5 ## 5 0.00000000 0.01000000 1.000000e+00 FALSE 1 ## 6 0.05515614 0.29336335 1.012755e-58 TRUE 5
bySegments : a boolean indicating whether the creation of summary statistics for HBD and FLOD has to be made by segments or not. By default, it is FALSE. The description below of HBD_recap and FLOD_recap depends on this choice.
HBD_recap : a dataframe with individuals in row and marker positions in column. It contains the mean of the HBD probability at a marker over the submaps for each individual. If the marker only appears in one submap (no mean needed), then it is just the HBD proba for that marker in that submap.
F1@HBD_recap[1:10, 1:10] # an individual * marker matrix ## rs11260588 rs6603793 rs7525092 rs4648808 ## HGDP00607:HGDP00607 0.004177668 2.496864e-06 0.013449101 1.303554e-03 ## HGDP00608:HGDP00608 0.004681580 7.370413e-06 0.002462725 2.144115e-03 ## HGDP00609:HGDP00609 0.006472987 6.699717e-03 0.004678171 1.981378e-03 ## HGDP00610:HGDP00610 0.014112102 9.672567e-06 0.014375800 4.413431e-06 ## HGDP00611:HGDP00611 0.000000000 1.749224e-03 0.000000000 0.000000e+00 ## HGDP00612:HGDP00612 0.588190216 6.450754e-01 0.742525650 8.549851e-01 ## HGDP00613:HGDP00613 0.000000000 0.000000e+00 0.000000000 0.000000e+00 ## HGDP00614:HGDP00614 0.768097811 8.940857e-01 0.943150847 8.654859e-01 ## HGDP00615:HGDP00615 0.012682905 7.809544e-06 0.298069056 2.112174e-05 ## HGDP00616:HGDP00616 0.021159744 1.379115e-04 0.017317536 3.134844e-02 ## rs10910047 rs903919 rs10910050 rs12082516 ## HGDP00607:HGDP00607 1.589246e-06 1.125124e-02 6.753533e-08 2.179563e-05 ## HGDP00608:HGDP00608 5.612408e-07 1.872173e-06 2.873811e-04 1.708830e-07 ## HGDP00609:HGDP00609 6.353068e-07 3.549626e-06 1.775769e-03 2.255507e-07 ## HGDP00610:HGDP00610 3.291959e-06 6.216004e-03 3.539811e-04 4.981109e-05 ## HGDP00611:HGDP00611 0.000000e+00 0.000000e+00 2.829025e-06 0.000000e+00 ## HGDP00612:HGDP00612 7.737806e-01 8.713890e-01 7.719123e-01 7.947139e-01 ## HGDP00613:HGDP00613 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ## HGDP00614:HGDP00614 9.491040e-01 9.723467e-01 9.810801e-01 9.606419e-01 ## HGDP00615:HGDP00615 2.029699e-03 2.848096e-01 1.783233e-05 9.361770e-06 ## HGDP00616:HGDP00616 1.014055e-04 7.419373e-03 1.869893e-02 2.371933e-03 ## rs2645065 rs2840530 ## HGDP00607:HGDP00607 1.361605e-05 6.883217e-06 ## HGDP00608:HGDP00608 1.557253e-03 1.320757e-03 ## HGDP00609:HGDP00609 2.420824e-06 5.038964e-04 ## HGDP00610:HGDP00610 2.596836e-03 1.064889e-04 ## HGDP00611:HGDP00611 0.000000e+00 0.000000e+00 ## HGDP00612:HGDP00612 8.800096e-01 8.715920e-01 ## HGDP00613:HGDP00613 0.000000e+00 0.000000e+00 ## HGDP00614:HGDP00614 9.129762e-01 8.904461e-01 ## HGDP00615:HGDP00615 2.550489e-05 5.212088e-04 ## HGDP00616:HGDP00616 7.551350e-03 1.909570e-02
F1@FLOD_recap[1:10, 1:10] # an individual * marker matrix ## rs11260588 rs6603793 rs7525092 rs4648808 rs10910047 ## HGDP00607:HGDP00607 -0.7356160 -2.3645709 -0.2330256 -1.2839118 -2.445319 ## HGDP00608:HGDP00608 -0.8985638 -2.5803655 -1.1633156 -1.1920183 -2.606459 ## HGDP00609:HGDP00609 -0.7942212 -0.7525045 -0.9233675 -1.2503623 -2.583709 ## HGDP00610:HGDP00610 -0.5796989 -2.6848496 -0.5369504 -2.6312421 -2.700256 ## HGDP00611:HGDP00611 0.0000000 -0.4871941 0.0000000 0.0000000 0.000000 ## HGDP00612:HGDP00612 1.0385936 1.0672972 1.1396964 1.1424752 1.125804 ## HGDP00613:HGDP00613 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 ## HGDP00614:HGDP00614 1.4390193 1.4926120 1.5179023 1.4920049 1.479503 ## HGDP00615:HGDP00615 -0.8429585 -2.9191408 0.5165932 -2.8809159 -1.623471 ## HGDP00616:HGDP00616 -0.5407423 -2.4970489 -0.6376400 -0.3489034 -2.580316 ## rs903919 rs10910050 rs12082516 rs2645065 rs2840530 ## HGDP00607:HGDP00607 -0.3098928 -2.3749872 -2.366543 -2.3113462 -2.4021939 ## HGDP00608:HGDP00608 -2.5639212 -2.0231409 -2.608148 -1.3587449 -1.3905425 ## HGDP00609:HGDP00609 -2.5874397 -1.3118144 -2.585480 -2.6015539 -1.7877458 ## HGDP00610:HGDP00610 -0.8971502 -2.0679398 -2.538792 -1.3015013 -2.3351223 ## HGDP00611:HGDP00611 0.0000000 -1.7420280 0.000000 0.0000000 0.0000000 ## HGDP00612:HGDP00612 1.2091884 1.1452436 1.137395 1.2135387 1.1508289 ## HGDP00613:HGDP00613 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 ## HGDP00614:HGDP00614 1.5311409 1.5329331 1.484750 1.5140532 1.5043511 ## HGDP00615:HGDP00615 0.4968378 -2.8805369 -2.912886 -2.8508932 -2.1709293 ## HGDP00616:HGDP00616 -1.0024447 -0.5993492 -1.491371 -0.9845619 -0.5632986
str(F1@HBDsegments[[1]]) ## 'data.frame': 20 obs. of 11 variables: ## $ id : Factor w/ 1 level "HGDP00607": 1 1 1 1 1 1 1 1 1 1 ... ## $ famid : Factor w/ 1 level "HGDP00607": 1 1 1 1 1 1 1 1 1 1 ... ## $ pheno : int 1 1 1 1 1 1 1 1 1 1 ... ## $ start : num 18788 20590 20600 20626 20649 ... ## $ end : num 19127 20595 20605 20647 20654 ... ## $ size : num 340 6 6 22 6 7 7 5 7 14 ... ## $ chromosome: int 5 5 5 5 5 5 5 6 6 6 ... ## $ start_pos : int 34611504 146164325 146571759 147249909 148176930 148622110 149027979 150577395 150743620 151206457 ... ## $ end_pos : int 58142052 146505732 146612787 148149539 148352615 148671072 149227931 150695815 150793567 151383747 ... ## $ start_dist: num 53.2 148.8 149.3 150.6 151.8 ... ## $ end_dist : num 69.4 149.1 149.6 151.6 152.2 ...
str(F1@HFLOD) ## 'data.frame': 63644 obs. of 6 variables: ## $ chr : int 1 1 1 1 1 1 1 1 1 1 ... ## $ snps : Factor w/ 63644 levels "rs1000000","rs10000010",..: 7589 46931 53318 39873 5306 59074 5307 10842 32111 33711 ... ## $ pos_cM: num 0.509 1.06 1.153 1.791 1.849 ... ## $ pos_Bp: int 1011521 1495118 1799950 2030796 2201709 2212443 2214736 2241020 2251082 2275590 ... ## $ HFLOD : num 0.0442 0.0685 0.0801 0.0725 0.1149 ... ## $ ALPHA : num 0.0205 0.0235 0.0245 0.0239 0.0318 ...
str(F1@bedmatrix)
We implemented a second inner method for the "Hotspots" method. The only
paramater that changes is recap.by.segments
, it is put to TRUE.
In the default "Hotspots" method the HBD probabilities and FLOD scores are computed for each marker (see above).
With the "Hotspots by segment" method, HBD probabilities and FLOD scores correspond to the mean of HBD probabilities and FLOD score of all the markers of a segment that have been selected in one of the sumaps. There is hence only one value for a segment delimited by hotspots. So we also recommend to set n.consecutive.marker to 1. Since results over multiple markers have already been averaged.
We use the same segment list that is used before (s1).
As said before the only argument that changes is "recap.by.segments", it is put to TRUE. And we recommend to adjust "n.consecutive.marker".
F2 <- makeAtlasByHotspots(x.be, 100, s1) F2 <- festim(F2) F2 <- setSummary(F2, recap.by.segments = TRUE, n.consecutive.marker = 1, list.id = "all")
We will now create segments, which will be used to create the submaps later :
s3 <- segmentsListByDistance(x.be) ## Finding segments for the genome : ....................... ## Finding which markers are between two segments: ....................... ## Finding mini segments .......................
This function creates segments based on the markers available in the data and the gaps between them. The value of the gap is imposed by the minimun step required between markers. If 2 adjacent markers have an inter-distance larger than the step, then there is a gap between them. The first marker of the pair is the end of a segment and the second marker is the start of a new segment. The function creates an object which will contain three slots :
You can watch a summary of what was done with :
segmentsListSummary(s3) ## chromosome number_of_segments number_of_markers ## 1 1 7 49639 ## 2 2 3 53765 ## 3 3 2 44564 ## 4 4 3 39942 ## 5 5 1 40976 ## 6 6 1 43239 ## 7 7 3 35507 ## 8 8 2 37282 ## 9 9 5 31192 ## 10 10 2 34493 ## 11 11 2 32005 ## 12 12 1 31873 ## 13 13 1 25191 ## 14 14 2 21450 ## 15 15 5 19594 ## 16 16 2 19727 ## 17 17 1 16629 ## 18 18 2 20165 ## 19 19 1 10739 ## 20 20 1 16911 ## 21 21 2 9645 ## 22 22 5 9730 ## 23 25 1 15
This function creates a dataframe with three colums :
We will now head toward the creation of submaps using the following commands :
F3 <- makeAtlasByDistance(x.be, 5, s3) F3 <- festim(F3) F3 <- setSummary(F3, list.id = "all")
The variable submaps becomes an submapsList object, you can watch the different elements of it with :
str(F3) # careful it can become huge depending on your data sizes
We implemented a paralellism method to make the creation of the submaps more efficient. We paralellized the creation of the submaps, that is to say, the selection of markers. Make sure to have an environment that can support the usage of multiple CPU.
Use the n.cores
argument, i.e the number of CPU that will be used to
make the differents submaps in the functions Fantasio
,
makeAtlasByHotspots
, makeAtlasByDistance
:
F1 <- Fantasio(bedmatrix = x.be, segments = "Hotspots", segment.options = list(hotspots = hotspot_hg18), n = 100, verbose = FALSE, n.cores = 10)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.