Introduction

Fantasio is composed of several functions. Its goals are:

Fantasio implements the creation of several random sparse submaps on genome-wide data (to remove linkage disequilibrium). It also provides graphical outputs to facilitate interpretations of homozygosity mapping results and plots.

In this vignette, we illustrate how to use the package, using the data set HGDP-CEPH, a ship which contains 1043 individuals and 660918 markers. In order to access the data, you will need to download the HGDP.CEPH package (see below how) and load it.

Not all options of the functions are described here, but rather their basic usage. The reader is advised to look at the manual page of the function for details.

Principal concepts

Fantasio implements a maximum likelihood method that uses a hidden Markov chain to model the dependencies along the genome between the (observed) marker genotypes of an individual, and its (unobserved) homozygous by descent (HBD) status. The emission probabilities of this hidden Markov model (HMM) depend on the allele frequencies. The transition probabilities depend on the genetic distance between two adjacent markers. This model allows estimating the inbreeding coefficient \(f\) of an individual, and a parameter \(a\), where \(af\) is the instantaneous rate of change per unit map length (here cM) from no HBD to HBD. Both HBD and non-HBD segment lengths are assumed to be distributed exponentially with mean lengths $\frac{1}{a(1-f)}$ and $\frac{1}{af}$, respectively.

The method requires the markers to be in minimal linkage disequilibrium (LD). Otherwise biased estimations of \(f\) are produced. A strategy consisting of generating multiple random sparse genome maps (submaps) has been proposed to avoid this bias (Leutenegger et al. 2011). When several submaps are considered, \(f\) is estimated as the median value on all the f estimation obtained on the different maps after removing submaps with \(a > 1\) (having an average HBD segment length of 1 cM is unlikely to be detected with a SNP density of 1 per 0.5 cM). This strategy has the advantage of not requiring any LD computation on the sample and of minimizing loss of information, as compared with a strategy that based on a single map of markers in minimal LD.

Fantasio statistical framework allows fixing HMM parameters to compute the likelihood of a mating type. These likelihoods can be used for:

When multiple submaps are used, the median p-values/probabilities are considered. See Leutenegger et al. 2011 for more details on the calculations.

Homozygosity mapping (Lander and Botstein 1987) consists in focusing on inbred affected individuals and searching for a region of the genome of shared homozygosity. Original homozygosity mapping requires that the genealogy of patients be known so that inbred patients can be identified and their respective \(f\) estimated. Leutenegger et al. (Leutenegger et al. 2006) proposed using the \(f\) estimated on genome-wide genetic data to compute a FLOD score, similar to Morton's LOD score (Morton 1955).

Genin et al. (Genin et al. 2012) adapted the FLOD formula for multiple submaps. $FLOD^{(i)}(m,s)$ is computed for each individual \(i\), each marker \(m\) on each submap \(s\), using the equation (1):

$FLOD^{(i)}(m,s) = log_{10}\frac{P\left(Y_{m,s}^{(i)} | H_{1}\right)}{P\left(Y_{m,s}^{(i)} | H_{0}\right)} = log_{10}\frac{P\left(X_{m,s}^{(i)}=1 | Y_{m,s}^{(i)}\right) + q.P\left(X_{m,s}^{(i)}=0 | Y_{m,s}^{(i)}\right)}{\hat{f}_s^{(i)} + q.\left(1-\hat{f}_s^{(i)}\right)}$

With the following parameters :

Results are then averaged over the different submaps to obtain a single \(FLOD^{(i)}(m)\) at each marker m.

Genin et al. (Genin et al. 2012) proposed to detect fully penetrant rare recessive variants by performing homozygosity mapping on inbred cases from Genome-Wide Association Study (GWAS) data. Linkage evidence is then evaluated over the entire set I of inbred cases by computing a FLOD score, $HFLOD(m,\alpha))$, at each marker m, in presence of genetic heterogeneity using a parameter $(\alpha)$ (2):

$HFLOD(m,\alpha)=\sum log_{10} \left[\alpha.\frac{P\left(Y_{m,s}^{(i)} | H_{1}\right)}{P\left(Y_{m,s}^{(i)} | H_{0}\right)}+ (1 - \alpha)\right ]= \sum log_{10} \left[\alpha . exp \left(FLOD^{(i)}(m) \times \log(10)\right)+(1-\alpha)\right]$

This heterogeneity score is then maximized over $(\alpha)$ to evaluate the evidence of linkage at marker m where $(\alpha)$ is the estimate of the proportion of cases linked to this locus (3):

$HFLOD(m)=max_{\alpha}(HFLOD(m,\alpha))$

Getting started

The package Fantasio depends on gaston. Hereafter the functions of this package are used for data manipulation. If you are not familiar with gaston, please refer to the vignette of this package for more information.

Installing Fantasio

First and foremost install the package with the following command:

devtools::install_github("genostats/Fantasio@margot", build_vignettes = TRUE)

After that we can load the package.

require(Fantasio)

## Loading required package: Fantasio
## Loading required package: parallel
## Loading required package: gaston
## Loading required package: Rcpp
## Loading required package: RcppParallel
## 
## Attaching package: 'RcppParallel'
## The following object is masked from 'package:Rcpp':
## 
##     LdFlags
## Gaston set number of threads to 8. Use setThreadOptions() to modify this.
## 
## Attaching package: 'gaston'
## The following object is masked from 'package:stats':
## 
##     sigma
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

Installing example data: HGDP-CEPH

We illustrate the usage of the package with the data provided in the data package HGDP-CEPH. First, we need to install it:

install.packages("HGDP.CEPH", repos="https://genostats.github.io/R/")

We can load the HGDP-CEPH data as follows:

require(HGDP.CEPH)

## Loading required package: HGDP.CEPH

filepath <- system.file("extdata", "hgdp_ceph.bed", package="HGDP.CEPH")
x <- read.bed.matrix(filepath)

## Reading /home/margot/R/x86_64-pc-linux-gnu-library/3.6/HGDP.CEPH/extdata/hgdp_ceph.rds 
## Reading /home/margot/R/x86_64-pc-linux-gnu-library/3.6/HGDP.CEPH/extdata/hgdp_ceph.bed

The object x is a bed.matrix object (see gaston package for details). The statistics on SNPs and individual callrates, etc, have not been computed; run

x <- set.stats(x)

## ped stats and snps stats have been set. 
## 'p' has been set. 
## 'mu' and 'sigma' have been set.

to add them in the bed matrix. These data have genetic distances (centiMorgans) set in the @snps slot, column dist:

head(x@snps)

##   chr         id       dist     pos A1 A2  N0  N1  N2 NAs N0.f N1.f N2.f NAs.f
## 1   1  rs3094315 0.09162711  742429  C  T  75 346 620   2   NA   NA   NA    NA
## 2   1 rs12562034 0.09918407  758311  A  G  81 275 674  13   NA   NA   NA    NA
## 3   1  rs3934834 0.49630053  995669  T  C  58 268 717   0   NA   NA   NA    NA
## 4   1  rs9442372 0.50394850 1008567  A  G 149 423 471   0   NA   NA   NA    NA
## 5   1  rs3737728 0.50800870 1011278  T  C  45 290 708   0   NA   NA   NA    NA
## 6   1 rs11260588 0.50867275 1011521  A  G   3  49 991   0   NA   NA   NA    NA
##    callrate        maf         hz
## 1 0.9980825 0.23823247 0.33237272
## 2 0.9875360 0.21213592 0.26699029
## 3 1.0000000 0.18408437 0.25695110
## 4 1.0000000 0.34563758 0.40556088
## 5 1.0000000 0.18216683 0.27804410
## 6 1.0000000 0.02636625 0.04697987

Most datasets come with physical positions (column pos in @snps), but no genetic distance. They can be added with Gaston. First, you need install the HumanGeneticMap package with install.packages("HumanGeneticMap", repos="https://genostats.github.io/R/").
Then, you would run

x <- set.dist(x, HumanGeneticMap::genetic.map.b36)

## Setting distances for chromosome 1 
## Setting distances for chromosome 2 
## Setting distances for chromosome 3 
## Setting distances for chromosome 4 
## Setting distances for chromosome 5 
## Setting distances for chromosome 6 
## Setting distances for chromosome 7 
## Setting distances for chromosome 8 
## Setting distances for chromosome 9 
## Setting distances for chromosome 10 
## Setting distances for chromosome 11 
## Setting distances for chromosome 12 
## Setting distances for chromosome 13 
## Setting distances for chromosome 14 
## Setting distances for chromosome 15 
## Setting distances for chromosome 16 
## Setting distances for chromosome 17 
## Setting distances for chromosome 18 
## Setting distances for chromosome 19 
## Setting distances for chromosome 20 
## Setting distances for chromosome 21 
## Setting distances for chromosome 22 
## No map for chromosome 23 
## No map for chromosome 24 
## No map for chromosome 25 
## No map for chromosome 26

as the SNP positions in these data are on hg18/b36 reference.

We are going to work on the Bedouin population. To select this population run:

x.be <- select.inds(x, population == "Bedouin")

Running Fantasio

Two differents methods for submaps creation are implemented in the package:

The function Fantasio calls two different functions for submaps creation, based on the value of segments: segmentsListByHotspots and segmentsListByDistance. The first function segmentsListByDistance is used to create a list of segments though the genome. The second function makeAtlasByDistance or makeAllSubsmapsbyHotspots is used to create the submaps.

Hotspots (Default)

By default, the submaps are created using the file of recombination hotspots and summarizing the results for each snp that appears in a submap.

F1 <- Fantasio(bedmatrix=x.be, segments="Hotspots", 
           segment.options = list(hotspots = hotspot_hg18), n = 100, 
           verbose = FALSE)

## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2.
## Using all inbred individuals with good estimation quality.

The function markerRepresentation gives for each marker of the original bed matrix the number of times it has been selected. It can be used in conjonction with table to obtain a summary of the number of markers never seen, seen once, twice, etc:

head(markerRepresentation(F1), 10)

##            id chr     pos       dist Freq
## 1   rs3094315   1  742429 0.09162711    2
## 2  rs12562034   1  758311 0.09918407    2
## 3   rs3934834   1  995669 0.49630053    0
## 4   rs9442372   1 1008567 0.50394850    2
## 5   rs3737728   1 1011278 0.50800870    6
## 6  rs11260588   1 1011521 0.50867275    0
## 7   rs9442398   1 1011558 0.50878962    0
## 8   rs6687776   1 1020428 0.54737207    0
## 9   rs9651273   1 1021403 0.54793723    1
## 10  rs4970405   1 1038818 0.55291100    2

table(markerRepresentation(F1)$Freq)

## 
##      0      1      2      3      4      5      6      7      8      9     10 
## 217897 154511 101646  62532  39023  24346  16145  10899   7659   5467   4090 
##     11     12     13     14     15     16     17     18     19     20     21 
##   3081   2267   1771   1474   1107    947    774    632    549    459    369 
##     22     23     24     25     26     27     28     29     30     31     32 
##    336    286    274    207    210    166    147    110    132    115    109 
##     33     34     35     36     37     38     39     40     41     42     43 
##    103     67     68     68     48     31     32     22     33     27     20 
##     44     45     46     47     48     49     50     51     52     53     54 
##     30     35     28     30     40     42     38     42     38     27     27 
##     55     56     57     58     59     60     61     62     63     64     68 
##     30     25     14      9      3      3      4      3      3      2      1 
##    100 
##    188

The slot F1@submap_summary contains a summmary of results accross submaps for all individuals:

head(F1@submap_summary, 10)

##        famid        id pheno submaps quality      f_min      f_max       f_mean
## 1  HGDP00607 HGDP00607     1     100     100 0.01968740 0.03266918 0.0249375555
## 2  HGDP00608 HGDP00608     1     100     100 0.03329754 0.04466905 0.0378773806
## 3  HGDP00609 HGDP00609     1     100     100 0.03555332 0.04570728 0.0395621560
## 4  HGDP00610 HGDP00610     1     100     100 0.04405081 0.05855172 0.0496007528
## 5  HGDP00611 HGDP00611     1      96      96 0.00000000 0.01027520 0.0003100903
## 6  HGDP00612 HGDP00612     1     100     100 0.04563621 0.06452940 0.0541143211
## 7  HGDP00613 HGDP00613     1     100     100 0.00000000 0.00000000 0.0000000000
## 8  HGDP00614 HGDP00614     1     100     100 0.02374184 0.03547134 0.0289076895
## 9  HGDP00615 HGDP00615     1     100     100 0.08069546 0.09608753 0.0888941435
## 10 HGDP00616 HGDP00616     1     100     100 0.06384241 0.08046822 0.0709961142
##      f_median   a_median   pLRT_median inbred pLRT_inf_0.05
## 1  0.02456151 0.13107792  1.534069e-37   TRUE           100
## 2  0.03771483 0.06290722  4.902576e-69   TRUE           100
## 3  0.03913424 0.10664411  1.144885e-72   TRUE           100
## 4  0.04932863 0.15372793  1.366378e-79   TRUE           100
## 5  0.00000000 0.01000000  1.000000e+00  FALSE             1
## 6  0.05363621 0.28137287  5.006136e-56   TRUE           100
## 7  0.00000000 0.01000000  1.000000e+00  FALSE             0
## 8  0.02847253 0.13542484  6.424485e-41   TRUE           100
## 9  0.08883544 0.09133073 5.673614e-168   TRUE           100
## 10 0.07040706 0.15074459 3.685643e-104   TRUE           100

If you are interested in the values of \(f\) and \(a\) accross the submaps, you can find them in F1@estimation_summary.

The function also computed HBD probabilities that can be found in F1@HBD_recap, which are used to determine the position of HBD segments, which are recapitulated in F1@HBDsegments. It is required that at least n.consecutive.marker markers are HBD before calling a HBD segment. Default value isn.consecutive.marker=5.

Moreover, FLOD and HFLOD linkage statistics are computed. The FLOD of each individual is in F1@FLOD_recap; they are used to compute the whole sample statistics HFLOD in F1@HFLOD.

By default, the package only computes HBD, FLOD statistics and HFLOD statistics for affected individuals (phenotype = 2). In case there are no affected individuals, all individuals are taken into account (a warning is issued). The list.id argument allows to specify a list of individuals.

Hotspots by Segments

For the "Hotspots" method, the results can also be summarized globally for each segment using option recap.by.segments=TRUE. In that case, n.consecutive.marker should be set to 1.

F2 <- Fantasio(bedmatrix = x.be, segments = "Hotspots", 
          segment.options = list(hotspots = hotspot_hg18), n = 100, 
          recap.by.segments = TRUE, n.consecutive.marker = 1, verbose = FALSE)

## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2.
## Using all inbred individuals with good estimation quality.

Distance

F3 <- Fantasio(bedmatrix=x.be, segments="Distance", n = 100, verbose = FALSE)

## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2.
## Using all inbred individuals with good estimation quality.

In addition to all the above mentionned summaries, the function markerSummary returns the number of SNPs in each submap. Note that when `segments = "Hotspots", it's always the number of hotspot segments, but here it varies from submap to submap.

head(markerSummary(F3))

##          number_of_markers_used
## Submap 1                   6229
## Submap 2                   6239
## Submap 3                   6236
## Submap 4                   6238
## Submap 5                   6221
## Submap 6                   6233

How to run logistic regression

To run logistic regression use run.logistic = TRUE in Fantasio function. (default = FALSE) Specify your explanatory variable in expl_var. It can be HBD_prob or FLOD. You can import a dataframe with covariates such as (age, sex, BMI ...) incovariates argument to run adjusted glm.

How to use the segment.option argument

In order to use the segment.option argument you need to pass a list of arguments, each variable name in the list must be an argument name in the function. The function that will be called is either createsSegmentsListByDistance if segments argument is equal to "Distance" or segmentsListByHotspots if segments argument is equal to "Hotspots" and the arguments list will be passed to it. So refer to these functions for possible arguments.

l <- list(hotspots = hotspot_hg18, minMarkers = 50) # default is 0
F4 <- Fantasio(bedmatrix = x.be, segments = "Hotspots", n = 100, segment.options = l, verbose = FALSE)

## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2.
## Using all inbred individuals with good estimation quality.

In the case of "Hotspots", by default, we do not require to have a minimum number of markers in each segment (minMarkers = 0). With the above command line, we impose to have at least 50 markers in a segment. Otherwise it is merged with the previous segment.

Plotting results

We illustrate the plotting function on the F1 object obtained with segments="Hotspots" and F2 obtained with segments="Hotspots" and recap.by.segments=TRUE.

HFLOD plots

This plot shows the values of the HFLOD linkage statistics along the genome (contained in F1@HFLOD).

HFLODManhattanPlot(F1)

The points are the HFLOD values. In the case of F1, they are computed at every marker included in at least one submaps. The red line is the value of their moving average (see below).

To plot a single chromosome, use the following function:

HFLODPlotChr(F1, chr=15)

HFLODPlotChr(F2, chr=15, MA=FALSE)

HBD plots

The HBD plots allow to visualize the positions of the HBD segments on the genome, contained in F1@HBDsegments. You can plot multiple individuals for one chromosome (by default, all individuals with \(f\) significantly positive are plotted), as illustrated below with F1 and F2:

HBDPlotChr(F1, chr=15)

HBDPlotChr(F2, chr=15)

It is also possible to look at all HBD segments of a given individual (again we show the results for F1 and F2):

HBDPlotId(F1, id = "HGDP00649", famid = "HGDP00649")

HBDPlotId(F2, id = "HGDP00649", famid = "HGDP00649")

GLM plots

Current data doesn't allow to use this funcitonnality of Fantasio but - when it's allowed you can plot QQ-plots and Manhattan-plots using the function glmHBDPlot. See documentation about this function for more information.

Step by step usage of the package Fantasio

Hotspots

Creation of the segments list

We will now create segments, which will be use to create the submaps later, further explication below, for now use this command :

s1 <- segmentsListByHotspots(x.be, hotspots = hotspot_hg18)

## Using hotspots from  hotspot_hg18 
## Gathering all hotspots for the genome : ......................
## Gathering all the genome's markers : ......................
## Finding which markers are between two hotspots : ......................

This function creates a list of chromosomes, in each, you have a list of several segments created thanks to the hotspots file given in argument (files are given with the package), in each segments you have SNPs index.

You can watch a summary of what was done with :

segmentsListSummary(s1)

##    chromosome number_of_segments number_of_markers
## 1           1               1143             47146
## 2           2               1075             51611
## 3           3                939             42555
## 4           4                882             38215
## 5           5                870             39191
## 6           6                834             41404
## 7           7                726             34011
## 8           8                706             35659
## 9           9                699             29513
## 10         10                782             32753
## 11         11                663             30521
## 12         12                739             30331
## 13         13                574             23931
## 14         14                481             20314
## 15         15                509             18429
## 16         16                549             18487
## 17         17                475             15527
## 18         18                552             18911
## 19         19                323             10004
## 20         20                465             15759
## 21         21                262              9025
## 22         22                271              8956

This function creates a dataframe with three colums :

Creation of the submaps and computation

We will now head toward the creation of submaps using the following commands :

F1 <- makeAtlasByHotspots(x.be, 5, s1)
F1 <- festim(F1)
F1 <- setSummary(F1, list.id = "all")

For the sake of clarity we have only created 5 submaps, but generally we do 100.

This function will creates 5 submaps, all the parameters can be modified (use args(makeAtlasByHotspots) for more informations).

The variable submaps becomes an submapsList object, you can watch the different elements of it with :

str(F1) #careful it can become huge depending on your data sizes

Descrition of the object F1

This object contains all the results of the different computations executed during the process of creating n submaps. Here is a complete description of each structure in this object :

str(F1@segments_list) #careful it can become huge depending on your data sizes
str(F1@submaps_list) #careful it can become huge depending on your data sizes
str(F1@likelihood_summary)
str(F1@estimation_summary)
head(F1@submap_summary)

##       famid        id pheno submaps quality      f_min       f_max      f_mean
## 1 HGDP00607 HGDP00607     1       5     100 0.02307054 0.028227601 0.024997474
## 2 HGDP00608 HGDP00608     1       5     100 0.03481927 0.040759254 0.038219872
## 3 HGDP00609 HGDP00609     1       5     100 0.03694378 0.040824604 0.038914434
## 4 HGDP00610 HGDP00610     1       5     100 0.04457247 0.053895480 0.050580150
## 5 HGDP00611 HGDP00611     1       5     100 0.00000000 0.005577867 0.001115573
## 6 HGDP00612 HGDP00612     1       5     100 0.05372673 0.061493384 0.056387572
##     f_median   a_median  pLRT_median inbred pLRT_inf_0.05
## 1 0.02363142 0.12652141 7.107875e-39   TRUE             5
## 2 0.03775612 0.06013269 6.020340e-64   TRUE             5
## 3 0.03849243 0.10540997 3.697390e-74   TRUE             5
## 4 0.05170420 0.15971819 1.573828e-79   TRUE             5
## 5 0.00000000 0.01000000 1.000000e+00  FALSE             1
## 6 0.05515614 0.29336335 1.012755e-58   TRUE             5
F1@HBD_recap[1:10, 1:10] # an individual * marker matrix

##                      rs11260588    rs6603793   rs7525092    rs4648808
## HGDP00607:HGDP00607 0.004177668 2.496864e-06 0.013449101 1.303554e-03
## HGDP00608:HGDP00608 0.004681580 7.370413e-06 0.002462725 2.144115e-03
## HGDP00609:HGDP00609 0.006472987 6.699717e-03 0.004678171 1.981378e-03
## HGDP00610:HGDP00610 0.014112102 9.672567e-06 0.014375800 4.413431e-06
## HGDP00611:HGDP00611 0.000000000 1.749224e-03 0.000000000 0.000000e+00
## HGDP00612:HGDP00612 0.588190216 6.450754e-01 0.742525650 8.549851e-01
## HGDP00613:HGDP00613 0.000000000 0.000000e+00 0.000000000 0.000000e+00
## HGDP00614:HGDP00614 0.768097811 8.940857e-01 0.943150847 8.654859e-01
## HGDP00615:HGDP00615 0.012682905 7.809544e-06 0.298069056 2.112174e-05
## HGDP00616:HGDP00616 0.021159744 1.379115e-04 0.017317536 3.134844e-02
##                       rs10910047     rs903919   rs10910050   rs12082516
## HGDP00607:HGDP00607 1.589246e-06 1.125124e-02 6.753533e-08 2.179563e-05
## HGDP00608:HGDP00608 5.612408e-07 1.872173e-06 2.873811e-04 1.708830e-07
## HGDP00609:HGDP00609 6.353068e-07 3.549626e-06 1.775769e-03 2.255507e-07
## HGDP00610:HGDP00610 3.291959e-06 6.216004e-03 3.539811e-04 4.981109e-05
## HGDP00611:HGDP00611 0.000000e+00 0.000000e+00 2.829025e-06 0.000000e+00
## HGDP00612:HGDP00612 7.737806e-01 8.713890e-01 7.719123e-01 7.947139e-01
## HGDP00613:HGDP00613 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## HGDP00614:HGDP00614 9.491040e-01 9.723467e-01 9.810801e-01 9.606419e-01
## HGDP00615:HGDP00615 2.029699e-03 2.848096e-01 1.783233e-05 9.361770e-06
## HGDP00616:HGDP00616 1.014055e-04 7.419373e-03 1.869893e-02 2.371933e-03
##                        rs2645065    rs2840530
## HGDP00607:HGDP00607 1.361605e-05 6.883217e-06
## HGDP00608:HGDP00608 1.557253e-03 1.320757e-03
## HGDP00609:HGDP00609 2.420824e-06 5.038964e-04
## HGDP00610:HGDP00610 2.596836e-03 1.064889e-04
## HGDP00611:HGDP00611 0.000000e+00 0.000000e+00
## HGDP00612:HGDP00612 8.800096e-01 8.715920e-01
## HGDP00613:HGDP00613 0.000000e+00 0.000000e+00
## HGDP00614:HGDP00614 9.129762e-01 8.904461e-01
## HGDP00615:HGDP00615 2.550489e-05 5.212088e-04
## HGDP00616:HGDP00616 7.551350e-03 1.909570e-02
F1@FLOD_recap[1:10, 1:10] # an individual * marker matrix

##                     rs11260588  rs6603793  rs7525092  rs4648808 rs10910047
## HGDP00607:HGDP00607 -0.7356160 -2.3645709 -0.2330256 -1.2839118  -2.445319
## HGDP00608:HGDP00608 -0.8985638 -2.5803655 -1.1633156 -1.1920183  -2.606459
## HGDP00609:HGDP00609 -0.7942212 -0.7525045 -0.9233675 -1.2503623  -2.583709
## HGDP00610:HGDP00610 -0.5796989 -2.6848496 -0.5369504 -2.6312421  -2.700256
## HGDP00611:HGDP00611  0.0000000 -0.4871941  0.0000000  0.0000000   0.000000
## HGDP00612:HGDP00612  1.0385936  1.0672972  1.1396964  1.1424752   1.125804
## HGDP00613:HGDP00613  0.0000000  0.0000000  0.0000000  0.0000000   0.000000
## HGDP00614:HGDP00614  1.4390193  1.4926120  1.5179023  1.4920049   1.479503
## HGDP00615:HGDP00615 -0.8429585 -2.9191408  0.5165932 -2.8809159  -1.623471
## HGDP00616:HGDP00616 -0.5407423 -2.4970489 -0.6376400 -0.3489034  -2.580316
##                       rs903919 rs10910050 rs12082516  rs2645065  rs2840530
## HGDP00607:HGDP00607 -0.3098928 -2.3749872  -2.366543 -2.3113462 -2.4021939
## HGDP00608:HGDP00608 -2.5639212 -2.0231409  -2.608148 -1.3587449 -1.3905425
## HGDP00609:HGDP00609 -2.5874397 -1.3118144  -2.585480 -2.6015539 -1.7877458
## HGDP00610:HGDP00610 -0.8971502 -2.0679398  -2.538792 -1.3015013 -2.3351223
## HGDP00611:HGDP00611  0.0000000 -1.7420280   0.000000  0.0000000  0.0000000
## HGDP00612:HGDP00612  1.2091884  1.1452436   1.137395  1.2135387  1.1508289
## HGDP00613:HGDP00613  0.0000000  0.0000000   0.000000  0.0000000  0.0000000
## HGDP00614:HGDP00614  1.5311409  1.5329331   1.484750  1.5140532  1.5043511
## HGDP00615:HGDP00615  0.4968378 -2.8805369  -2.912886 -2.8508932 -2.1709293
## HGDP00616:HGDP00616 -1.0024447 -0.5993492  -1.491371 -0.9845619 -0.5632986
str(F1@HBDsegments[[1]])

## 'data.frame':    20 obs. of  11 variables:
##  $ id        : Factor w/ 1 level "HGDP00607": 1 1 1 1 1 1 1 1 1 1 ...
##  $ famid     : Factor w/ 1 level "HGDP00607": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pheno     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ start     : num  18788 20590 20600 20626 20649 ...
##  $ end       : num  19127 20595 20605 20647 20654 ...
##  $ size      : num  340 6 6 22 6 7 7 5 7 14 ...
##  $ chromosome: int  5 5 5 5 5 5 5 6 6 6 ...
##  $ start_pos : int  34611504 146164325 146571759 147249909 148176930 148622110 149027979 150577395 150743620 151206457 ...
##  $ end_pos   : int  58142052 146505732 146612787 148149539 148352615 148671072 149227931 150695815 150793567 151383747 ...
##  $ start_dist: num  53.2 148.8 149.3 150.6 151.8 ...
##  $ end_dist  : num  69.4 149.1 149.6 151.6 152.2 ...
str(F1@HFLOD)

## 'data.frame':    63644 obs. of  6 variables:
##  $ chr   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ snps  : Factor w/ 63644 levels "rs1000000","rs10000010",..: 7589 46931 53318 39873 5306 59074 5307 10842 32111 33711 ...
##  $ pos_cM: num  0.509 1.06 1.153 1.791 1.849 ...
##  $ pos_Bp: int  1011521 1495118 1799950 2030796 2201709 2212443 2214736 2241020 2251082 2275590 ...
##  $ HFLOD : num  0.0442 0.0685 0.0801 0.0725 0.1149 ...
##  $ ALPHA : num  0.0205 0.0235 0.0245 0.0239 0.0318 ...
str(F1@bedmatrix)

Hotspots by segments

We implemented a second inner method for the "Hotspots" method. The only paramater that changes is recap.by.segments, it is put to TRUE.

In the default "Hotspots" method the HBD probabilities and FLOD scores are computed for each marker (see above).

With the "Hotspots by segment" method, HBD probabilities and FLOD scores correspond to the mean of HBD probabilities and FLOD score of all the markers of a segment that have been selected in one of the sumaps. There is hence only one value for a segment delimited by hotspots. So we also recommend to set n.consecutive.marker to 1. Since results over multiple markers have already been averaged.

Creation of the segments list

We use the same segment list that is used before (s1).

Creation of the submaps and computation

As said before the only argument that changes is "recap.by.segments", it is put to TRUE. And we recommend to adjust "n.consecutive.marker".

F2 <- makeAtlasByHotspots(x.be, 100, s1)
F2 <- festim(F2)
F2 <- setSummary(F2, recap.by.segments = TRUE, n.consecutive.marker = 1, list.id = "all")

Distance

Creation of the segments list

We will now create segments, which will be used to create the submaps later :

s3 <- segmentsListByDistance(x.be)

## Finding segments for the genome : .......................
## Finding which markers are between two segments: .......................
## Finding mini segments .......................

This function creates segments based on the markers available in the data and the gaps between them. The value of the gap is imposed by the minimun step required between markers. If 2 adjacent markers have an inter-distance larger than the step, then there is a gap between them. The first marker of the pair is the end of a segment and the second marker is the start of a new segment. The function creates an object which will contain three slots :

You can watch a summary of what was done with :

segmentsListSummary(s3)

##    chromosome number_of_segments number_of_markers
## 1           1                  7             49639
## 2           2                  3             53765
## 3           3                  2             44564
## 4           4                  3             39942
## 5           5                  1             40976
## 6           6                  1             43239
## 7           7                  3             35507
## 8           8                  2             37282
## 9           9                  5             31192
## 10         10                  2             34493
## 11         11                  2             32005
## 12         12                  1             31873
## 13         13                  1             25191
## 14         14                  2             21450
## 15         15                  5             19594
## 16         16                  2             19727
## 17         17                  1             16629
## 18         18                  2             20165
## 19         19                  1             10739
## 20         20                  1             16911
## 21         21                  2              9645
## 22         22                  5              9730
## 23         25                  1                15

This function creates a dataframe with three colums :

Creation of the submaps and computation

We will now head toward the creation of submaps using the following commands :

F3 <- makeAtlasByDistance(x.be, 5, s3)
F3 <- festim(F3)
F3 <- setSummary(F3, list.id = "all")

The variable submaps becomes an submapsList object, you can watch the different elements of it with :

str(F3) # careful it can become huge depending on your data sizes

Parallelism with the package

We implemented a paralellism method to make the creation of the submaps more efficient. We paralellized the creation of the submaps, that is to say, the selection of markers. Make sure to have an environment that can support the usage of multiple CPU.

Use the n.cores argument, i.e the number of CPU that will be used to make the differents submaps in the functions Fantasio, makeAtlasByHotspots, makeAtlasByDistance :

F1 <- Fantasio(bedmatrix = x.be, segments = "Hotspots", 
               segment.options = list(hotspots = hotspot_hg18), n = 100, 
               verbose = FALSE, n.cores = 10)


genostats/Fantasio documentation built on Feb. 2, 2023, 5:28 p.m.