This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples. The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable. Four files are necessary for the analysis: mutation information file, BAM file, and mutation supporting read ID information file.
This tsv or tsv.gz file should contain at least these columns:
Sample Mut_type Chr Pos Ref Alt SimpleRepeat_TRF
Neighborhood_sequence
sample_name 1-snv chr1 108130741 C T N
CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC
SimpleRepeat_TRF: Whether the mutation locates at a simple repeat sequence or
not ("Y" or "N").
Neighborhood_sequence: [5'-20 bases] + [Alt sequence] + [3'-20 bases]
Sample, Mut_type, Chr, Pos, Ref, and Alt should be set exactly.
If you do not know the SimpleRepeat_TRF, Mut_type, or Neighborhood_sequence,
enter "-". Automatically detected.
(mandatory, if multiple samples are processed in a batch)
Seven to ten columns are necessary (without column names).
Optional columns can be deleted if they are not applicable.
[sample name] [mutation information tsv file] [BAM file] [read length]
[adapter sequence read 1] [optional: adapter sequence read 2]
[sample type: Human or Mouse] [panel name]
[optional: reference genome fastq file]
[optional: simple repeat region bed file]
PC9 ./source/CCLE.tsv ./source/Cell_line/PC9.bam 127
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Human TOP
A375 ./source/CCLE.tsv.gz ./source/Cell_line/A375.bam 127
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Hg38 TOP ./reference/hg38.fa
./reference/simpleRepeat.bed.gz
(optional, but mandatory with cram files)
(optional, but mandatory to detect simple repeat derived artifacts)
This pipeline contains 8 filtering processes.
Filter 1 : Shorter-supporting lengths distribute too unevenly to occur
(1-1 and 1-2).
Filter 1-1: P-values are less than the threshold_p(default: 10^(-6)).
Filter 1-2: The shorter-supporting lengths distributed over less than 75% of
the read length.
Filter 2 : Hairpin-structure induced error detection (2-1 and 2-2).
Filter 2-1: Palindromic sequences exist within 150 bases.
Filter 2-2: >=50% mutation-supporting reads contains a reverse complementary
sequence of the opposite strand consisting >= 15 bases.
Filter 3 : 3’-/5’-supporting lengths are too unevenly distributed to occur
(3-1 and 3-3).
Filter 3-1: P-values are less than the threshold_p(default: 10^(-6)).
Filter 3-2: The distributions of 3’-/5’-supporting lengths are within 75% of
the read length.
Filter 4 : >=15% mutations were called by chimeric reads comprising two
distant regions.
Filter 5 : >=50% mutations were called by soft-clipped reads.
Filter 6 : Mutations locating at simple repeat sequences.
Filter 7 : Mutations locating at a >=15 homopolymer.
Filter 8 : >=10% low quality bases (Quality score <18) in the mutation
supporting reads.
Supporting lengths are adjusted considering small repeat sequences around the mutations.
Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N]
$ Rscript MicroSEC.R ~ \ ~/source/Sample_list_test.txt N $ Rscript MicroSEC.R ~ \ ~/source/sample_info_test.tsv.gz Y
If you want to know the progress visually, [progress bar Y/N] should be Y.
Results are saved in a tsv file.
github url: https://github.com/MANO-B/MicroSEC
wd <- "~" knitr::opts_chunk$set(collapse = TRUE, fig.width = 12, fig.height = 8, echo = TRUE, warning = FALSE, message = TRUE, comment = "#>") options(rmarkdown.html_vignette.check_title = FALSE, show.error.messages = FALSE, warn = -1) progress_bar <- "N"
library(MicroSEC)
# initialize msec <- NULL homology_search <- NULL mut_depth <- NULL # test data sample_name <- "sample" read_length <- 150 adapter_1 <- "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" adapter_2 <- "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" organism <- "hg38" # load mutation information df_mutation <- fun_load_mutation( system.file("extdata", "mutation_list.tsv", package = "MicroSEC"), "sample", BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, 24) df_bam <- fun_load_bam( system.file("extdata", "sample.bam", package = "MicroSEC")) # another example data # data(exampleMutation) # data(exampleBam) # df_mutation <- exampleMutation # df_bam <- exampleBam # load genomic sequence ref_genome <- fun_load_genome(organism) chr_no <- fun_load_chr_no(organism) # analysis result <- fun_read_check(df_mutation = df_mutation, df_bam = df_bam, ref_genome = ref_genome, sample_name = sample_name, read_length = read_length, adapter_1 = adapter_1, adapter_2 = adapter_2, short_homology_search_length = 4, min_homology_search = 40, progress_bar = progress_bar) msec_read_checked <- result[[1]] homology_searched <- result[[2]] mut_depth_checked <- result[[3]] # search homologous sequences msec_homology_searched = fun_homology(msec = msec_read_checked, df_distant = homology_searched, min_homology_search = 40, ref_genome = ref_genome, chr_no = chr_no, progress_bar = progress_bar) # statistical analysis msec_summarized <- fun_summary(msec_homology_searched) msec_analyzed <- fun_analysis(msec = msec_summarized, mut_depth = mut_depth_checked, short_homology_search_length = 4, min_homology_search = 40, threshold_p = 10 ^ (-6), threshold_hairpin_ratio = 0.50, threshold_short_length = 0.75, threshold_distant_homology = 0.15, threshold_soft_clip_ratio = 0.50, threshold_low_quality_rate = 0.1, homopolymer_length = 15) # save the results as a tsv.gz file. #fun_save(msec_analyzed, "~/MicroSEC_test.tsv.gz")
msec_analyzed
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.