# seqCAT: The High Throughput Sequencing Cell Authentication Toolkit In seqCAT: High Throughput Sequencing Cell Authentication Toolkit

hct116_rko$impact == "MODERATE", ] nrow(hct116_rko_hm)  ## Evaluation of specific chromosomes, regions, genes and transcripts You might be interested in a specific chromosome or a region on a chromosome, and it might be useful to work with data for only that subset. This operation is easily performed on a comparison dataframe: hct116_rko_region <- hct116_rko[hct116_rko$chr == 12 &
hct116_rko$pos >= 25000000 & hct116_rko$pos <= 30000000, ]


You might also be interested in a specific gene or transcript, of special importance to your study:

hct116_rko_eps8_t <- hct116_rko[hct116_rko$ENSTID == "ENST00000281172", ] hct116_rko_vamp1 <- hct116_rko[hct116_rko$ENSGID == "ENSG00000139190", ]
hct116_rko_ldhb <- hct116_rko[hct116_rko$gene == "LDHB", ] head(hct116_rko_ldhb)  Here we see two mutations in the LDHB gene, one mismatching MODIFIER variant and one matching LOW variant. This is a good approach to check for known mutations in your dataset. For example, the HCT116 cell line is supposed to have a KRASG13D mutation. We might look for this using its known rsID or position: hct116_rko_kras <- hct116_rko[hct116_rko$rsID == "rs112445441", ]
hct116_rko_kras <- hct116_rko[hct116_rko$chr == 12 & hct116_rko$pos == 25398281, ]
nrow(hct116_rko_kras)


The reason that we don't find this particular variant in the HCT116 vs. RKO comparison is that it is not present in the RKO profile, either because it isn't a mutation in RKO or because there was no confident variant call for that particular position. The compare_profiles function only looks at overlapping positions by default, so we will have to look at the individual profiles instead. seqCAT has two functions to help with this: list_variants and plot_variant_list:

The list_variants function looks for the genotypes of each specified variant in each provided SNV profile. First, let's create a small set of interesting variants we want to look closer at:

known_variants <- data.frame(chr  = c(12, 12, 12, 12),
pos  = c(25358650, 21788465, 21797029, 25398281),
gene = c("LYRM5", "LDHB", "LDHB", "KRAS"),
stringsAsFactors = FALSE)
known_variants


The minimum information needed are the chr and pos columns; any additional columns (such as gene, here) will just be passed along for later use. We can now pass this set (along with our SNV profiles) to the list_variants function:

variant_list <- list_variants(list(hct116, rko), known_variants)
variant_list


While this gives you a nice little list of the genotypes of your specified variants, we can also visualise this using the plot_variant_list function. It takes a slightly modified version of the output from the list_variants function: it may only contain the genotype columns. We thus need to create row names to identify the variants, like this:

# Set row names to "chr: pos (gene)"
row.names(variant_list) <- paste0(variant_list$chr, ":", variant_list$pos,
" (", variant_list\$gene, ")")

# Remove "chr", "pos" and "gene" columns
to_remove <- c("chr", "pos", "gene")
variant_list <- variant_list[, !names(variant_list) %in% to_remove]

# Plot the genotypes in a grid
genotype_grid <- plot_variant_list(variant_list)
genotype_grid


This gives us an easily overviewed image of what variants are present in which samples, and their precise genotype. We can see that the KRASG13D mutation is indeed present in the HCT116, but not in RKO. We can also see that RKO has a homozygous G/G genotype for one of the LDHB variants, while HCT116 is heterozygous (T/G) for the same. (Please note that this data was aligned and analysed using the GRCh37 / hg19 assembly and that listed positions might not be accurate for other assemblies.)

# Evaluating multiple comparisons

Many scientific studies compare more than just two datasets, not to mention meta-studies and large-scale comparisons. It is therefore important to be able to characterise and evaluate many-to-one or many-to-many cases as well - the seqCAT package provides a number of functions and procedures for doing so.

## Performing multiple profile comparisons

The first step of such an analysis is to create and read SNV profiles for each sample that is to be evaluated (please see [section 2][Creation of SNV profiles]). The example data used here has three different samples: HCT116, HKE3 and RKO. The compare_many function is a helper function for creating either one-to-many or many-to-many SNV profile comparisons, and returns a list of the global similarities for all combinations of profiles and their respective data (for downstream analyses):

# Create list of SNV profiles
profiles <- list(hct116, hke3, rko)

# Perform many-to-many comparisons
many <- compare_many(profiles)
many[[1]]


We can here see the summary statistics of all three combinations of the cell lines in the example data. Notice that compare_many will only perform a comparison that has not already been performed, i.e. it will not perform the RKO vs. HCT116 comparison if it has already performed HCT116 vs. RKO. Also notice that it does perform self-comparisons (i.e. HCT116 vs. HCT116), which is useful for downstream visualisations.

The similarities are stored in the first element of the results (many[[1]]), while the data for each comparison is stored in the second (many[[2]]). The second element is itself also a list, whose indices correspond to the row names of the similarity object. If we, for example, are interested in the HKE3 self-comparison, we can see that its row name is 4. We can then access its data like this:

hke3_hke3 <- many[[2]][[4]]


You may also specify the a and b similarity score parameters, as above. If you are interested in only a one-to-many comparison (for cases when you have a "true" baseline profile to compare against), you can do this by also specifying the one = <profile> parameter in the function call. This is useful if you have a COSMIC profile to compare against, for example:

many_cosmic <- compare_many(profiles, one = cosmic)
many_cosmic[[1]]


It is important to note that performing many comparisons like this may take quite some time, depending on the number of profiles and how much data each profile has. By returning all the data in a list you may then save each comparison to a file, for later re-analysis without having to re-do the comparisons.

## Visualising multiple comparisons

A useful and straightforward way of visualising multiple profile comparisons is to use a heatmap. We can use the summary statistics listed in the similarity object from above as input to the function plot_heatmap, which gives you a simple overview of all your comparisons:

heatmap <- plot_heatmap(many[[1]])
heatmap


Here we see a blue colour gradient for the similarity score of the three cell lines, which are clustered according to their similarity (using cluster = TRUE, as default). You may change the size of the text annotations using annotation_size = 5 (default) or suppress them entirely (annotate = FALSE). You may also suppress the legend (legend = FALSE), change the main colour of the gradient (colour = "#1954A6" by default) or change the limits of the gradient (limits = c(0, 50, 90, 100) by default). The choice of gradient limits are based on clarity (comparisons with a similarity score less than 50, i.e. those that likely have too few overlapping variants to begin with, are suppressed) and the previously mentioned 90 % concordance threshold [@Yu2015].

This heatmap makes it clear that HCT116 and HKE3 are, indeed, very similar to each other, while RKO differs from them both. These types of heatmaps can be created for an arbitrary number of samples, which will then give a great overview of the global similarities of all the samples studied. This can be used to evaluate the quality of the datasets (e.g. to see which comparisons have very few overlaps), find similarity clusters and potential unexpected outliers. If a sample stands out in a heatmap such as this, that is grounds for further investigation, using both the methods described above and more classical evaluations of sequencing data (read quality, adapter contamination, alignments, variant calling, and so on).

file.remove("hct116.profile.txt")


# Citation {-}

If you are using seqCAT to analyse your data, please cite the following article:

seqCAT: a Bioconductor R-package for variant analysis of high throughput sequencing data
Fasterius E. and Al-Khalili Szigyarto C.
F1000Research (2018), 7:1466
https://f1000research.com/articles/7-1466

# Session info {-}

sessionInfo()


# References

## Try the seqCAT package in your browser

Any scripts or data that you put into this service are public.

seqCAT documentation built on Nov. 8, 2020, 7:36 p.m.