options(width = 100) knitr::opts_chunk$set(echo = TRUE, dev = c('png','pdf'), fig.width= 6, fig.height=4, collapse = TRUE, comment = "#>" )
Rscript <- readLines(file.path("..","inst","extscript","align_sequences.R")) begin <- which(Rscript == "#- Begin blast -#") + 1 end <- which(Rscript == "#- End blast -#") - 1 Blast_Rcsript <- Rscript[begin:end] begin <- which(Rscript == "#- Begin get_closest_seq -#") + 1 end <- which(Rscript == "#- End get_closest_seq -#") - 1 get_closest_seq_Rcsript <- Rscript[begin:end] begin <- which(Rscript == "#- Begin align -#") + 1 end <- which(Rscript == "#- End align -#") - 1 align_Rcsript <- Rscript[begin:end] begin <- which(Rscript == "#- Begin postprocess_align_gaps -#") + 1 end <- which(Rscript == "#- End postprocess_align_gaps -#") - 1 postprocess_align_gaps_Rcsript <- Rscript[begin:end] begin <- which(Rscript == "#- Begin postprocess_align_identical -#") + 1 end <- which(Rscript == "#- End postprocess_align_identical -#") - 1 postprocess_align_identical_Rcsript <- Rscript[begin:end] begin <- which(Rscript == "#- Begin postprocess_align_DRM -#") + 1 end <- which(Rscript == "#- End postprocess_align_DRM -#") - 1 postprocess_align_DRM_Rcsript <- Rscript[begin:end] Rscript <- readLines(file.path("..","inst","extscript","build_trees.R")) begin <- which(Rscript == "#- Begin build_tree -#") + 1 end <- which(Rscript == "#- End build_tree -#") - 1 build_tree_Rcsript <- Rscript[begin:end] begin <- which(Rscript == "#- Begin postprocess_tree -#") + 1 end <- which(Rscript == "#- End postprocess_tree -#") - 1 postprocess_tree_Rcsript <- Rscript[begin:end] Rscript <- readLines(file.path("..","inst","extscript","Ancestral_state_to_clades.R")) begin <- which(Rscript == "#- Begin relabelling -#") + 1 end <- which(Rscript == "#- End relabelling -#") - 1 relabelling_Rcsript <- Rscript[begin:end] begin <- which(Rscript == "#- Begin phyloscanner -#") + 1 end <- which(Rscript == "#- End phyloscanner -#") - 1 phyloscanner_Rcsript <- Rscript[begin:end] begin <- which(Rscript == "#- Begin postprocess_phyloscanner -#") + 1 end <- which(Rscript == "#- End postprocess_phyloscanner -#") - 1 postprocess_phyloscanner_Rcsript <- Rscript[begin:end] begin <- which(Rscript == "#- Begin extract_clades -#") + 1 end <- which(Rscript == "#- End extract_clades -#") - 1 extract_clades_Rcsript <- Rscript[begin:end]
In this document, we describe the procedure step by step to construct the transmisson chains (or transmission clusters) from the viral genomic sequences. The code showed in this document shall not be run as it stands, but the scritps to run are introduced with some explanatory comments.
In the following, we present the procedure that we used to study data from Public Health Seattle and King County (PHSKC) to analyse the HIV epidemic in King County. Then, we sometimes refer to the PHSCK data and the PHSCK context because the scripts are made to be applied in this framework.
Below, the following sections address:
phyloHIV
package
For what follows, it is assumed that you are available:
Moreover, for the codes described in this document, it is required to set up an option
file, see the example in the package folder extjson/
.
This folder contains two json files:
options.json
: is about input/output paths/names and some required variables
(see below), andinternal_options.json
: is about required values that you should not change
sat first. A detailed description is given with the vignette phyloHIV_options
.
Nevertheless, you can generate your option json file by using the write_option_file
function and/or by contemplating the example option json file:
In order to build phylogenies, we need 1) viral genomic sequences of infected persons in the population of interest and 2) sequences from external persons who are likely to have infected persons in the population of interest.
For the first point "1)", we only have sequences of a part of the infected persons in King County (and another document shows how we take this into account while modelling phylogenetic subtrees). For the second point "2)", at first we do not have any evidences about who are likely to have infected a person living at King County. We then rely on an online genetic database in order to get the sequences from people in the world for which are close to the available sequences from PHSCK (see below for the detailed steps).
Thus first, we download a bunch of sequences from the Los Alamos National Lab website (LANL, https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html). For each subtype, downloads are limited to sequences that includes the PR/RT regions. In particular, we specify the region with the start (2253) and the end (3253), expect for subtype B as we explain in the following. We also ask information about persons related to the sequences, as sexe, accession number, geographic region and infection year.
Before dealing with these sequences, we preprocess them by removing duplicates and
the gaps (for the illustration of this package, we also reduce the number of
downloaded sequences).
From this bunch of (background) sequences, we expect that only a (small) part is
related to the SKC epidemic.
In order to determine the relevant sequences, we use the blast software
(version 2.7.1+, https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.7.1/)
by
1) making a blast database for each subtype (with the makeblastdb
function) and by
2) determining the most similar LANL sequences from the SKC sequences (with the blastn
function).
This code have created for each subtype the following files in path_LANL_seq
directory (path defined in extjson/options.json
):
.db.nin
: result of makeblastdb
,.db.nhr
: result of makeblastdb
,.db.nsq
: result of makeblastdb
and..._closest.txt
: result of blastn
.We finally obtain the closest sequences with the get_closest_seq
function which
also removes the duplicates (if a background sequence is one of the closest for two or
more SKC sequences).
Moreover, this function checks if the percent of background sequences in the final
file (closest background sequences + SKC sequences) is not lower than 70% (or more).
If the percent is lower, it means that the LANL database is too small and/or
that some closest sequences are the same for a lot of SKC sequences.
If we keep going with a such final sequences file, we should have difficulty to
find links between the background epidemic and the SKC epidemic (i.e. some introductions)
and then we would obtain bigger transmission clusters than we should actually
estimate.
Thus, if you have a percent lower than 70%, the get_closest_seq
function starts
again with a larger number of closest background sequences for each SKC sequences.
This code have created the following files in the
path_aligned
directory (path defined in extjson/options.json
):
..._match.txt
: a file about the closest LANL sequences,..._closest.fasta
: the closest LANL sequences, newseq_LANL_Subtype_XXX.fasta
, andClosest_result.rda
: an R object, the result of closest_result
that contains
for each subtype closest_result$prop
, the proportion of LANL sequences in newseq_LANL_Subtype_XXX.fasta
, and closest_result$nbre_close
, the number of
closest LANL sequences for each fake sequence.When we ran the script, we did not obtain a large enough percent for the subtype B.
The last iteration of the get_closest_seq
function (which selects the
80 closest background sequences) provided a percent of background sequences about 40%.
Thus, we have aimed to download more B sequences from the LANL website and
we have used less restrictive criteria. Then, we have downloaded the B sequences
for which the region (2253 - 2953) is available.
Then, we obtain a percent around 70-75% which is sufficient to work.
As we then have closest sequences from the SKC sequences for each subtype, the last step before building the phylogenies is to align (and to trim) the sequences.
At this stage, we then have sequences files for each subtype which contains the SKC sequences and a majority of background sequences. In order to efficiently estimate the phylogeny root during a next step, we add to each sequences file, an outgroup from another subtype. Since the genetic distance between two different subtypes is (in general) larger than the genetic distance between two sequences of a same subtype, the root of the tree should be in the additional subtype group. Nevertheless, if we add sequences from a very different subtypes, the estimation of the phylogeny could be complicated, as aforementioned. Then, for each subtype we add few sequences from the closest subtype, according to Li et al. (2015). Therefore, we should obtain for each subtype a fasta file containing (in this order), the closest LANL sequences, the PHSKC sequences and the outgroup.
With the following code, we add the outgroup at each file, as aforementioned, and
we align the sequences by using
MAFFT
(version 7.4). It is also possible to change an option of the function align_seq
to use the CodonAlign software.
This code have created for each subtype the following files in the
path_aligned
directory (path defined in extjson/options.json
):
..._mafft_aligned.fasta
: the aligned sequences by using MAFFT andoutgroup.rda
: an R object that contains the sequence names of
the outgroups.If the align_seq
function is run with MAFFT and if it contains more than 500
sequences, the algorithm splits the sequence files into small files
(with fasta-splitter
).
In order to align all sequences, the algorithm will
1) align a first small file and
2) align the $(i+1)$-th file with respect to the $i$ first aligned small files
which are merged into one file.
If the align_seq
function is run with CodonAlign, the algorithm calls a python
script. I have slightly changed the script in order to specify the name and the
directory of the output.
However, the script is quite long to run, maybe because it handles the whole file.
Indeed, I did not find the way to use it as MAFFT, with a piecewise running.
Before keep going, we need to postprocess the alignment to fix some errors or to remove patterns/sequences that could badly impact the phylogeny estimation.
Firstly, we remove the columns which mostly contains gaps. This kind of columns could be because of an error of the alignment algorithm which for few sequences at the same loci
For doing this, we use the seq.strip.gap
function from the big.phylo
package
(see the GitHub link).
In pratice, we decide to remove gap columns if the columns contains less than 25
non-gaps.
For subtypes with a large number of sequences, we have use a threshold equals to
100 instead of 25 (see Remove more columns
in the).
This code overwrites the ..._mafft_aligned.fasta
sequences in path_aligned
.
Nevertheless, we have observed that it can still remain some bugs in the alignment after this corrections, especially for the biggest subtype (C and B). We then need to manually fix some alignments bugs with a genome browser (see Geneious).
Below, I show some examples of fixes.
First case, before: {#id .class height=150px}
First case, after: {#id .class height=100px}
Second case, before: {#id .class height=250px}
Second case, after: {#id .class height=250px}
Third case, before: {#id .class height=250px}
Third case, after:
{#id .class height=250px}
As we have aligned sequences, it is now possible to detect duplicates if some
sequences are identical with the find_identical_sequences
function.
Nevertheless, there is a non null probability that two infected persons have the
same viral genome, especially if one have just infected the other one.
However, even if it the case, we choose to remove identical sequences (just keeping one),
since we do not have evidences and because it could badly impact the
phylogeny estimation.
This code overwrites the ..._mafft_aligned.fasta
sequences in path_aligned
and save the removed sequence indexes in indexes_identical.rda
.
Below, we also show how to remove drug resistance mutations that could not be
relevant to distinguish infected persons in a genomic point of view.
We use code from the big.phylo
package, in particular the seq.rm.drugresistance2
function that I have slightly changed so that it runs on my laptop.
This code have created for each subtype the following files in the
path_aligned
directory (path defined in extjson/options.json
):
..._mafft_aligned_ndrm.fasta
:
building_tree
and we can choose between different algorithm
to estimate the phylogeny (fasttree
, ExaML
or maximum_parsimony
).
We usually choose fasttree since we had difficulty to install EXaML and because
the maximum_parsimony
function provides NaN
values and then crashed without
any possibility to obtain a result.
This code have created some files in path_tree
.
For each subtype, a folder is created which contains the following outputs:
..._mafft_aligned_ndrm.newick
: the estimated tree, ..._mafft_aligned_ndrm_ft.XXX.newick
: the bootstrap estimated trees andRAxML...
: some files about the estimation.The option bs.n
is the number of bootstrap tree to compute.
At first, we choosed to compute only few of them in order to quickly obtain the
results.
We recently increase this number to 500 (and also 1000) in order to obtain more
information about the phylogeny estimate uncertainty.
Finally, we run the following function to remove the external group (additional sequences from another subtype and the HXB2 sequence). Moreover, the tree root is defined as the position of the external group.
This code have created the following files in path_tree/.../
.
..._rerooted.newick
: the same tree which was rerooted and for which
the outgroup was removed (same for the boostrap trees).
From the phylogenetic trees, we can not directly determine clusters of transmission occuring in SKC since we only know the location of the tips. In this Section, we present how to estimate the ancestral state (i.e. the location of each node in this case) with the Sankoff algorithm.
The first step is to relabel the tips according to their groups.
For instance, the persons infected in SKC are in the SKC
group, the
persons registered in the PHSCK database which are not a kc_case
are in the
WA\KC
group and the tips related to background sequences are in different groups
according to their world region origins (Europe
, Asia
, ...).
Below, we give a sketch of the script to use, which is a part of the
use_phyloscanneR
script.
This code have created the following files in path_tree/.../
.
..._renamed.newick
: the same tree which was rerooted and for which
the outgroup was removed (same for the boostrap trees).If you are interested about a specific group (as for example the heterosexual persons, HSX
)
you can define a subgroups:
KCHSX
)KCnonHSX
)It is then required to add the next lines to the extjson/options.json
file.
"focus_subgroup": ["HSX_F","HSX_M"], # the categories to focus "index_subgroup": 7, # in which spot is this information in the tip labels "name_subgroup": "HSX" # the name of the subgroup that will be use to relabel the tips
We are now able to use a phyloscanner
script to estimate the ancestral states.
The following chunk could be long to run and it could be necessary to run it
on HPC.
And a postprocess step:
This code have created the following files in path_colored_tree/.../
:
..._trait.csv
and ..._MRCA.txt
: useful files for the next step.
We show below how to extract SKC transmission clusters from the phylogeny and the
ancestral state reconstruction.
In other words, it corresponds to extracting subtrees according to the coloring.
In the following, we give a sketch of the Clades_extracting
R script to
extract clusters of one subtype.
Nevertheless, we usually gather the clusters of all the subtypes in a same list.
This code have created for each bootstrap an RData file, Clades_save_XXX.RData
in path_clades
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.