aRchaic is an R package that performs model based
clustering and visualization of ancient and modern DNA samples (see archaic_plot()
section for example visualization).
This model can be used to identify distinct patterns of DNA damage and
is capable of reflecting relative rates of contamination in contaminated aDNA
samples.
First and foremost, the user is required to install
aRchaic requires R version to be (>= 3.4). If your R version is lower, please upgrade.
And Python packages,
Upon completion of these steps, start a new R session and install aRchaic:
install.packages("remotes") remotes::install_github("kkdey/aRchaic")
Now you should be able to load aRchaic into R.
library(aRchaic)
To ensure reproducible results, we set the seed:
set.seed(1)
For every mismatch in a BAM file with respect to a reference genome, aRchaic records various features of the mismatch - mismatch type, flanking bases, strand break base, strand and position of mismatch from the ends of read - into a .csv file which we call the Mismatch Feature Format (MFF) file.
A typical example MFF file looks like the following
mff <- read.csv(system.file("extdata","ancients","NE2_subsampled.csv",package = "aRchaic"), header = FALSE) head(mff)
The first column represents mismatch type and flanking bases : (left flank base)(mismatch)(right flank base) <-> (A)(C->T)(C). The second and third columns represent the position of mismatch from 5' and 3' ends of reads after mapped to reference. The 4th (5th) columns represent the base 1 base upstream (downstream) from 5' (3') end of read respectively. The 6th column contains information of which strand the read comes from, and the 7th column is an identifier for the read containing the mismatch.
We provide a script to generate the MFF files from the BAM/SAM files here. This linked folder also contains a small example BAM file for testing. The simple two-line code to create the MFF file from this BAM file is as follows
samtools index example.bam python generate_summary_bams.py -b example.bam -f /path/to/reference/FASTA/hs37d5 .fa -o example_mff.csv
The user is expected to organize the MFF files in seperate folders per study. Here, we provide an example with two studies, one with 5 ancient samples derived from Gamba et. al. 2014 and a second study with 5 modern samples derived from the 1000 Genomes Project. These two folders are automatically downloaded and saved under inst/extdata/moderns
and inst/extdata/ancients
when aRchaic is installed.
moderns_dir <- system.file("extdata","moderns", package = "aRchaic") list.files(moderns_dir, pattern = ".csv")
ancients_dir <- system.file("extdata","ancients", package = "aRchaic") list.files(ancients_dir, pattern = ".csv")
Next we prepare the input data to aRchaic model from the above data folders.
out <- archaic_prepare(dirs = c(moderns_dir, ancients_dir))
The output out
should be a list with 2 elements - corresponding to 2 study
folders. Each element is a matrix with samples along the rows, coded
mismatch features along columns, and a cell in the matrix record counts
of the number of times a coded mismatch pattern occurs in a sample.
We provide here the modern MFF files used in Figure 2 of our paper for reproducing the analysis.
aRchaic fits a mixed membership model on the prepared
mismatch feature data from archaic_prepare()
that allows each sample to
have partial memberships in more than one cluster
where the clusters represent distinct mismatch feature profiles.
We first choose a directory - output_dir
- to save the model output and
visualization.
Next we run the archaic_fit
function to fit the mixed membership model.
We choose the number of clusters $K=2$
output_dir <- tempdir() model <- archaic_fit(out, K = 2, output_dir = output_dir)
the output model
consists of
The model output, in particular omega and theta can be visualized using
the archaic_plot()
command.
archaic_plot(model, output_dir = output_dir, background = "modern")
Once the run finishes (a few seconds), go to the output directory output_dir
.
You should see there is a structure.pdf
file that represents the cluster
memberships in omega in a stacked bar chart (see Fig 1 left).
Also, the output_dir
should contain for each $k=1,2$, a logo_clus_k.pdf
file
visualizing the mismatch profile for each cluster as in theta. If
background = "modern"
in archaic_plot
code above, this reresentation will
plot enrichment of the mismatch profiles with respect to a modern background.
The following is a summary of these visualizations
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.