# rmarkdown::render("vignettes/vignette.Rmd")
library(MotifFinder)
knitr::opts_chunk$set(fig.width=10)

Simulate Data

We create 300 DNA sequences of length 200 with the motif "ATgTT_GtCC" around the center of 50% of the sequences.

library(MotifFinder)
set.seed(42)
simulated_sequences <- simulate_sequences(motif="ATgTT_GtCC")

str(simulated_sequences)

Run MotifFinder

We run MotifFinder with a length slightly shorter than the known motif length.

motif_found <- findamotif(simulated_sequences$seqs, len=7, stranded_prior = T, seed = 42)

Plot the Motif(s) Found

We can see that we have recovered the motif.

library(ggseqlogo)
ggseqlogo(get_PWM(motif_found))
ggseqlogo(get_PWM(motif_found, complement=TRUE))

Where is the motif

We can also check the location of the motifs found as well as the location it thinks the motifs are.

plot_motif_location(motif_found)
plot(motif_found$prior)

Check Convergence

It's a good idea to plot the inferred probability the motif being in each sequence (alpha parameter) for each iteration to check convergence.

plot(motif_found$alphas, ylim=c(0,1))

Other helper functions include downloading PWMs from the Jaspar or Hocomoco databases.

Alx1 <- download_PWM("ALX1_MOUSE.H11MO.0.B")
str(Alx1)

Extracting DNA from the genome

To use real DNA instead of simulating sequences you will need to download the genome and extract regions that are specified using a BED file.

```{bash, getFASTA, eval=FALSE} wget ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz gunzip motifs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz samtools faidx motifs/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

bedtools getfasta -s -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed example_locations.bed -name > example_locations.fasta

We can then load these sequences using the helper function load_sequences

```r
genomic_sequences <- load_sequences("example_locations.fasta")


MyersGroup/MotifFinder documentation built on June 7, 2019, 3:42 p.m.