README.md

crispR

These are functions written for a takehome exercises for a Bioinformatics Software Engineer position. The functions identify candidate guide (protospacer) sequences from a given genomic region (string) / or a FASTA file. Never heard back from the company.

Below my answers.

Installation

# install.packages("devtools")
devtools::install_github("c1au6i0/crispR")

Part1 Answers

a) the code for the function

You can access the code of the function find_proto here.

b) the code to call the function with the example variables (and others, if desired)

library(crispR)
find_proto(d_seq = "TGATCTACTAGAGACTACTAACGGGGATACATAG",
           l = 2,
           PAM = "NGG")
## # A tibble: 3 x 5
##   start_p end_p protospacer PAM   strand
##     <int> <dbl> <chr>       <chr> <chr> 
## 1      20    21 AA          CGG   +     
## 2      21    22 AC          GGG   +     
## 3      22    23 CG          GGG   +

..or using DNA of the Dopamine Transporter (DAT internal data):

library(crispR)
print(DAT)
## [1] "TTTGCAAACGCTCGCATGTCACCGAAGGCGCAACAGCTCCGATTTTGAAATTTCCAACACGGCCCTCAAGTTGAAAGTTTTCCAAAAAAATTTAAACACCTCCGCCCATGTGAATGTGAAGTGAAATTCGGGTTTGGCATTCGGCATTGGTTGTGTGGAGCTTTTTTCTAAGTTTTCTGGATATTTTTCAAAAGTCTCAAGGATTATTTAACAATGGATTGGAGCAACTACATGATTTGAGCTTGATATTTATAAGGTAAGTAAGCATATAAACCAGTTTTTAGGTATGCATTCAAATAACTGCGATGGGAATTATAAAATCGAATGAGGAGAATAACTGAATAATTTAAAGCGAGCAACAAATAAACAACATAATATCTTTAAGAACCTAGTTTTCAAGTGCAGCTGGCTTAAAGAGATCAAAAATTAAATATTCATTAGCTGAATCATGTTGGGCATGGTGTTAAAAAATCGCTGTAAAATGCAAAAATTTAAAATGTTAATGATTACAGATGGAAATACTGATAGGTGATAACTTCAAAATGTGTTGTATGTTTGATAAGAAAGCAATAAAGAAAGAATGTAAAATTTAAACACTAATTTAAAGTTGTTAATTTAAACACCACACCAAATTTTTAATATTCTTTAATATTACTGTAATACTTTGTAAACTGGCCAACTAAGATTGCCATCGAATCTGGAACAAAAAGATGGATTCTAGTCATAAATTAACCCAGCGCAGCTCCGAGTACTTTAAGCACTCAATCTGCATTTCGCTTAATTGATCGCAACTCAGGGCAATTAAAGTCAGCGGGGAGGAATTAAGTAGATCAGATTAATTGTTTGACGTGTTTGTGGTTTCATTACACGCAAAGATGCCCGAGAGTTGGGCACATACATATAGACGGATATATGTACATATGTATGTATGTTCTTGTTCTGTGGTGTGGATGGTCAAGTGTTTGCTGCCCAGTGTGTTTGAACATGCCACCTGTCGTATGCGTAATGTGCCAACGAGCTCTTCAAGGGCTGGGTAAGCACATGTCGTTGCCAAACAGGTTTCAAGTGCCCTGGACACACACTTATTGAAGCCCATTGATTGGTGAAGGGTTTTGGATTTTGCGTGGGGTACTGGGATTTAAATCATTGAATATGGTTCTTATTTCTGGCATATCTGCGCAACTGACCCATTTTGAGTGCTGCGATTTTCACAGATTTCAAAGTGCGGCGGCGGGATCTCCACATTTTGGGTCAAGGACACGACGGGTTATATAGTCGCAGATGTCACCAACCGGACATATATCCAAATCAAAGACGCCCACGCCACACGATAACGATAACAATAGCATCAGCGACGAGCGCGAAACATGGAGCGGCAAAGTGGATTTTTTATTATCGGTTATTGGATTCGCCGTCGATTTAGCAAACGTCTGGAGATTTCCCTATTTGTGCTACAAAAATGGCGGTGGTAAGTAGATTCTTTTAGATCCTACACTATTATACAGAAGAAGATCACTATTAAATTGGATACATAAATAAAAAAAAGGTGTCTAAAGGCGCTCTAATGTTATTAGACTGTTTTATGATAATCTTAAAACCATATCATACCTTTAAGATACTTTCTTTATACTTCATTTAAGATACTTTCAGCCAATGTTGGTGATTTTTCTCCGTGTGAACTGGCCAATTAGGGGATAAAGACATTTCGTCAATCGCTCTGGCTGAGTGAATATAATTTATATGGCTTTACACATCATTGGGCTCGCCACGGAACCCGTCGCTTTCCTCTTTCCGCTTTCTGAATTTCCCTTTCGGCCTTGTTAACACTTTGTCAAAGTGGTGCGAAGTGCTGTCGGTGGGGTGGTGCGGGAATGGTTCCTTTGTCAAATTGCAGGGCTTTTCCCCGCTAAAAGAGCGGCAAAGGACACACAGTCGAACACTTACTTCGCTTCCTGATCCTCATCGCCGACGCCATCATTCAAAACAATTCCAAAAGTTAAGTGCTTCTGTATGACAACTAAATCTGGCAGTCATAATCGCATTTGGCCTGCTTACCCTCTGTCACACTTTTTCACCCCATTTTCTTGGCAATATGTCAGCCTGCTTATGTGGAGTTCTCTAGGCTTTTGTTGCTGGAGCACTTTCCACCATCGCATTGTGGGAAATTCATGTGCGAAACATGTCTCCATGTCTCCACCTCCCCATTTTCCCTCTGATTGCGGCACTTTCCCGCTGATGCGTCGTGAATGGATGTGGCACACAGGGCCTTCGATATAGTTATTACTATTTTATGAGTTTTCTTTTTTTCCTCCTGCTAACCCCCGGTCGCTGCTGCTGCTGCTGCTTTAAAATCATAATAATAAATTGTTATATTTATACGCTTCATTGCGTTCAATCAGTGGTTCCAGCTCCATCTTCAGCTTAAGTTCCATTTTATGTCCGTTGCCAGGGGCAGCAACTCGCTCAGGTTGCGATTATTACCACAGCTTAGTCGGTATTATTTATTCTGTTTCCCATTTTCCTGCTGCCTGCAACTCGGAAGTGGAAGCTTTGGTGGCAGCAACAGGCTGTGCCATCCACATTGTTCTGGATTAGGGCTAAAACTCAAGGAGCTGCCTGGCTTTTTTTCAAACTAAGATCGCGCATACGCACTTGATTGTACTTGGGTTAACTATGTATTTCAATAAATTATATTCTTTATTCTTAAATTTGTCTAAAAGTTTAGATTTCTATTTGGAACTGAAGGACTTAATGGTGAGACTTTGCACCTCAGCTTAAACCAGAATAGTCCTACACAAATACATGTAAAAAGTTATTCACAAGCATTTAAGTAGGCATAACTTTCAATAATTCAATAGAATATCAATTTGCTAGTCAAGCAATAACCAACACTTGGCCGTTTAAAATTGGACAAGTTTTGACCGAATTACGGGATGGGAATCGAATTAGAGGCACATATTTCCACAGTGTCTCCGACGAGCTGCTGGAAAATGGTGGAAATGCACTTGAAGCAACGGAAATTATGGCTTCAACAACAGAGACGAGCGTCATGGATGCCCCCGACTTCAGTTTAGACGGTTTTGTTTTAAATTTCCGACATAATTGACACGACAGAAGGAAGCACCGTAGCTGGCAGACCCTTTGTAGATATACTTTGCACCGAACGCCCTTGCATACAGTTGGGTGTATGTGGGTTTTGGCCCTAGACGACTTTGTCGTCGGCCACTTTGCCTGCTAATAGACATAATCAAATCATAGAACATCATCCCCTGCCACTCGACAAAGTATCGATGGCACTCTAACCAATTGTCTGGCCCAAAAGGGGCGTGTAACTTTGGCTCGGCTGCTTTTCCGGAGGTGACGGAGATGACGGCGATGACTGCAGGAGACAGCCGAGTTGGGTTTTTCCCCGCATATTCCGCCTCTCACCCGCCAGCTTGCTGCTGCTGTTAATGTTTTTAAAATGTCCTTAGAGTGGCACTTGTAAAACAATTTGATTTTGTGTGCCATCGAACATATACATAAAGGTCGAGTGTGTCTATATAGAAACTGTGTGTGTAATTATGTTTGTCGATGAGGGGGCAGCGCGAACAATGCACAGATTGTGACACGGGTCAGTCTCAATATGTGATGGTTATGATGCAATCATTATCATAGAAAACATATTGATTGTTTCTACAGCTCGAAAATAAGTTGGTTAAAAATTACCGGATGTAATGCGGAGAATATCTACGATTGTGATTTGGAATCGTAGCTTCCAGTTCATTGAAATATAATCAAATTTTATTGAATCCGTATGTTCTCTTAGACTAGATTTAATTTAAAAATATGTGAAATTTTAGGCGCTTTTCTTGTGCCCTACGGCATTATGTTGGTGGTCGGTGGCATTCCTCTGTTCTATATGGAATTGGCCCTGGGTCAGCACAATCGTAAGGGTGCCATAACCTGCTGGGGTCGCTTGGTGCCCCTCTTCAAAGGTAATTACTAACTACTAACTAATTAATCACAAGAACTTCATGGATAAGACACTTTAACTTACAGGAATTGGATATGCCGTGGTGCTGATAGCCTTCTATGTGGACTTCTATTACAATGTGATTATTGCCTGGTCGTTGCGTTTCTTTTTTGCATCCTTCACCAACTCGCTGCCTTGGACGTCGTGTAATAATATTTGGAACACACCAAATTGTAGACCGGTAACTAGATACATAACCATTTAATATAAATGAATCCTTATTAATTCTAACTTCTTATCAGTTTGAGAGCCAAAATGCATCTCGTGTTCCGGTTATTGGTAACTATAGTGACTTGTATGCCATGGGAAATCAAAGCCTGCTCTACAATGAGACATATATGAATGGTTCGAGCTTGGATACGTCAGCGGTTGGCCATGTGGAGGGTTTTCAGTCCGCAGCATCGGAATATTTTAAGTGAGTTTGCTGAAATATTCACTTTATAATTGTAGTTAAATTTAATTTCTCGTATAGCCGCTACATTTTGGAGCTGAACCGCAGCGAGGGAATCCATGACTTGGGCGCCATCAAATGGGACATGGCGCTGTGCCTTCTGATTGTCTACTTAATTTGTTACTTCTCCCTGTGGAAGGGTATCAGCACTTCGGGCAAGGTTGTGTGGTTTACTGCCCTCTTTCCCTATGCAGTGCTACTGATTCTTCTAATCAGAGGGCTCACACTGCCGGGATCTTTTCTGGGCATTCAGTATTATCTTACGCCCAACTTTAGTGCCATCTATAAGGCTGAGGTCTGGGTGGATGCTGCCACCCAGGTGTTTTTCTCATTGGGTCCAGGATTTGGAGTGCTGCTGGCCTATGCATCCTATAATAAATACCATAACAATGTATACAAGTAGGTTGCAAGTCTTATACTTCAATAATTCCTCACTTAAATTATTACTTAACTTGATTACAGGGATGCTTTGTTAACCAGTTTCATTAACTCGGCTACCAGCTTTATAGCCGGCTTTGTGATATTCTCCGTGCTGGGTTACATGGCCCACACACTGGGTGTAAGAATTGAAGATGTTGCCACCGAAGGACCTGGTCTGGTTTTCGTGGTCTATCCAGCTGCCATTGCCACCATGCCGGCCAGCACTTTCTGGGCTCTAATATTCTTCATGATGCTGCTAACTTTGGGCTTGGATAGTTCGGTAGATATAACCTATACTCTATATCGACCTTAAACTAATTGTACAACTCTTACAGTTTGGTGGTTCAGAGGCTATAATCACAGCTTTGAGCGACGAGTTTCCCAAGATCAAAAGAAACCGAGAGCTGTTTGTAGCTGGACTGTTTTCCCTGTACTTCGTGGTCGGTTTGGCCAGTTGCACTCAGGGTGGCTTCTATTTCTTCCATCTGCTGGATCGTTACGCTGCTGGCTACTCGATTTTGGTGGCCGTGTTCTTTGAGGCAATCGCCGTGTCCTGGATCTACGGAACCAATCGATTTAGCGAGGATATACGGGACATGATTGGTTTTCCACCGGGAAGATACTGGCAGGTGTGTTGGCGATTTGTGGCACCAATTTTCCTGCTCTTCATCACGGTTTACGGGCTGATTGGCTACGAGCCACTGACATATGCGGACTATGTGTATCCCAGTTGGGCCAATGCGCTGGGTTGGTGCATAGCTGGTTCCTCGGTTGTGATGATTCCTGCCGTGGCGATATTTAAACTACTTTCCACGCCGGGAAGTCTGCGTCAGCGGTTCACAATTTTGACCACACCATGGCGAGATCAGCAATCGATGGCAATGGTGCTGAACGGGGTCACCACCGAGGTCACCGTGGTGCGATTAACCGACACGGAGACCGCCAAGGAACCCGTCGATGTCTGAGTTCGACCAGTGGCCCGTTTTCAAATTTACTACGTTTAGATTTGGAAATTTACCAACAACCGATGTTCACGTATGTAGATTGTGGCTTTGCAGGAGAGTTTGTGTTTTTGGTTCGACATTTGGCGCAGATCCGCAGAAGGATCGGCAGCAATTTCGCAAACAACGTACTTAGGTTTCGGCACCAAAGAAAAAAAAGAAGAAAACCCACAAGCAAACGCAGCTTCAAACCATAAGTTCTAAGTTAAGATTAGCTTGTAGTTCGTATGGTTAATGCCCAGTTATAGCATACTCATATATATACGATGCTGTGTAGATAATATCAAGTCCGAAAGTGCGAAACACTTGTCAGTTAATCTATCGAGAACTCCTTGAAAGAATGTTTGCATATGCCAATGGAATTAACGACACGATCGAGGCTCAAAATTATTGGGCACATAATCTGTACATACACAAGTCGATGATGTAAAAGCTTTAAAACACGTTATAGGATATCTCATGGAACTGAAGCAAAAGCTGCCAGAGTTGAACTACAAAATCCGCTCAAACAGCCATGAGGGGATTTTGTCATATGCAAAGGTTTGTTCTAGACTAATTAACCCCGTCCAAAAAAAAAAAACTACGATTGTTTGCTTAAAGCAGCCATTTTAAGCAAACAATTGAGAGTTATCGAAATAATACATTCCCTTTTATAAACTCATTTCTGTACATAGATGTATGATATTTTAGCATTGTTTTAAGAGTTTCTTTCACGCTGAAGCGGAACTAACAGCCATGAGTTCTGTTTCAATTTGTTGAAATAAATTATTATACTTTGAGTTACT"
library(crispR)
find_proto(d_seq = DAT, 
           l = 20, 
           PAM = "NGG")
## # A tibble: 353 x 5
##    start_p end_p protospacer          PAM   strand
##      <int> <dbl> <chr>                <chr> <chr> 
##  1       6    25 AAACGCTCGCATGTCACCGA AGG   +     
##  2      40    59 CGATTTTGAAATTTCCAACA CGG   +     
##  3     109   128 TGTGAATGTGAAGTGAAATT CGG   +     
##  4     110   129 GTGAATGTGAAGTGAAATTC GGG   +     
##  5     115   134 TGTGAAGTGAAATTCGGGTT TGG   +     
##  6     122   141 TGAAATTCGGGTTTGGCATT CGG   +     
##  7     128   147 TCGGGTTTGGCATTCGGCAT TGG   +     
##  8     136   155 GGCATTCGGCATTGGTTGTG TGG   +     
##  9     158   177 GAGCTTTTTTCTAAGTTTTC TGG   +     
## 10     180   199 GATATTTTTCAAAAGTCTCA AGG   +     
## # … with 343 more rows

c) the time complexity for the function (in big-O notation)

I am not explicitly using any loop, but my function is in any case iterating and looking at each nucleotide of the sequence by using grep (stringr and regular expressions).

time Complexity: O(n)

Part2 Answers.

a) The code for the function

You can access the code of the function find_FASTA here.

b) The source of the FASTA file used for the reference genome in the example problem

I downloaded the Reference Genome Sequence GRCh38 from here.

c) How many candidate guide (protospacer) sequences were identified in the example problem

A total of 54 protospacers were identified on strand (+). Please note the arguments “start”, “end” and “l” are 1-indexed and intervals are fully closed.

d) The list of candidate guide (protospacer) sequences in a tab-delimited file…

A tab-delimited file can be downloaded here.

Dependencies

All the dependencies are listed in the Description file in my github account here.



c1au6i0/crispR documentation built on Feb. 27, 2020, 12:42 a.m.