if (!reticulate::py_module_available("tensorflow")) { knitr::opts_chunk$set(eval = FALSE) } else { knitr::opts_chunk$set(eval = TRUE) }
library(deepG) library(magrittr) library(keras)
options(rmarkdown.html_vignette.check_title = FALSE)
```{css, echo=FALSE} mark.in { background-color: CornflowerBlue; }
mark.out { background-color: IndianRed; }
## Introduction The most common use case for the deepG data generator is to extract samples from a collection of fasta (or fastq) files. The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$. We can differentiate between 2 approaches + **Language model**: Part of a sequence is the input and other part the target. + Example: Predict the next nucleotide given the previous 100 nucleotides. + **Label classification**: Assign a label to a sequence. + Example: Assign a label "virus" or "bacteria" to a sequence of length 100. Suppose we are given 2 fasta files called "a.fasta" and "b.fasta" that look as follows: <div style="float: left;margin-right:10px"> <table> <tr> <td> **a.fasta** <br> <tt> >header_a1 <br> AACCAAGG <br> >header_a2 <br> TTTGGG <br> >header_a3 <br> ACGTACGT <br> </tt> </td> </tr> </table> </div> <div style="float: left"> <table> <tr> <td> **b.fasta** <br> <tt> >header_b1 <br> GTGTGT <br> >header_b2 <br> AAGG <br> </tt> </td> </tr> </table> </div> <br><br><br><br><br><br><br><br><br> If we want to extract sequences of length 4 from these files, there would be 17 possible samples (5 from <tt>AACCAAGG</tt>, 3 from <tt>TTTGGG</tt>, ...). A naive approach would be to extract the samples in a sequential manner: *1. sample*: <div style="float: left;margin-right:10px"> <table> <tr> <td> **a.fasta** <br> <tt> >header_a1 <br> <mark class="in">AACC</mark>AAGG <br> >header_a2 <br> TTTGGG <br> >header_a3 <br> ACGTACGT <br> </tt> </td> </tr> </table> </div> <div style="float: left"> <table> <tr> <td> **b.fasta** <br> <tt> >header_b1 <br> GTGTGT <br> >header_b2 <br> AAGG <br> </tt> </td> </tr> </table> </div> <br><br><br><br><br><br><br><br><br> *2. sample*: <div style="float: left;margin-right:10px"> <table> <tr> <td> **a.fasta** <br> <tt> >header_a1 <br> A<mark class="in">ACCA</mark>AGG <br> >header_a2 <br> TTTGGG <br> >header_a3 <br> ACGTACGT <br> </tt> </td> </tr> </table> </div> <div style="float: left"> <table> <tr> <td> **b.fasta** <br> <tt> >header_b1 <br> GTGTGT <br> >header_b2 <br> AAGG <br> </tt> </td> </tr> </table> </div> <br><br><br><br><br><br><br><br><br> ... <br> *17. sample*: <div style="float: left;margin-right:10px"> <table> <tr> <td> **a.fasta** <br> <tt> >header_a1 <br> AACCAAGG <br> >header_a2 <br> TTTGGG <br> >header_a3 <br> ACGTACGT <br> </tt> </td> </tr> </table> </div> <div style="float: left"> <table> <tr> <td> **b.fasta** <br> <tt> >header_b1 <br> GTGTGT <br> >header_b2 <br> <mark class="in">AAGG</mark> <br> </tt> </td> </tr> </table> </div> <br><br><br><br><br><br><br><br><br> *18. sample*: <div style="float: left;margin-right:10px"> <table> <tr> <td> **a.fasta** <br> <tt> >header_a1 <br> <mark class="in">AACC</mark>AAGG <br> >header_a2 <br> TTTGGG <br> >header_a3 <br> ACGTACGT <br> </tt> </td> </tr> </table> </div> <div style="float: left"> <table> <tr> <td> **b.fasta** <br> <tt> >header_b1 <br> GTGTGT <br> >header_b2 <br> AAGG <br> </tt> </td> </tr> </table> </div> <br><br><br><br><br><br><br><br><br> ... <br><br> For longer sequences this is not a desirable strategy since the data is very redundant (often just one nucleotide difference) and the model would often see long stretches of data from the same source. Choosing the samples completely at random can also be problematic since we would constantly have to open new files. The deepG generators offers several option to navigate the data sampling strategy to achieve a good balance between the two approaches. ## Data generator options In the following code examples, we will mostly use the sequence <tt> **abcdefghiiii** </tt> to demonstrate some of the deepG data generator options. (In real world application you would usually have sequences from the <tt>ACGT</tt> vocabulary.) ```r sequence <- paste0("a", "b", "c", "d", "e", "f", "g", "h", "i", "i", "i", "i") vocabulary <- c("a", "b", "c", "d", "e", "f", "g", "h", "i")
We may store this sequence in a fasta file
temp_dir <- tempfile() dir.create(temp_dir) dir_path <- paste0(temp_dir, "/dummy_data") dir.create(dir_path) df <- data.frame(Sequence = sequence, Header = "label_1", stringsAsFactors = FALSE) file_path <- file.path(dir_path, "a.fasta") # sequence as fasta file microseq::writeFasta(fdta = dplyr::as_tibble(df), out.file = file_path)
Since neural networks can only work with numeric data, we have to encode sequences of characters with numeric data.
Usually this is achieved by one-hot-encoding; there are some other approaches implemented: see use_coverage
, use_quality_score
and ambiguous_nuc
sections.
# one-hot encoding example s <- c("a", "c", "a", "f", "i", "b") s_as_int_seq <- vector("integer", length(s)) for (i in 1:length(s)) { s_as_int_seq[i] <- which(s[i] == vocabulary) - 1 } one_hot_sample <- keras::to_categorical(s_as_int_seq) colnames(one_hot_sample) <- vocabulary one_hot_sample
The length of the input sequence.
The set of allowed characters in a sequence. What happens to characters outside the vocabulary can be controlled with
the ambiguous_nuc
argument.
The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$.
The train_type
argument determines how $X$ and $Y$ get extracted.
Possible arguments for language models are:
output_format
argument.Besides the language model approach, we can use label classification . This means we map some label to a sequence. For example, the target for some nucleotide sequence could be one of the labels "bacteria" or "virus". We have to specify how to extract a label corresponding to a sequence. Possible arguments are:
| file | label_1 | label_2 |
| --- | --- | --- |
| "a.fasta" | 1 | 0 |
Another option is "dummy_gen": generator creates random data once and repeatedly returns them.
Extract target from fasta header (fasta header is "label_1" in example file):
# get target from header vocabulary_label <- paste0("label_", 1:5) gen <- get_generator(path = file_path, train_type = "label_header", batch_size = 1, maxlen = 6, vocabulary = vocabulary, vocabulary_label = vocabulary_label) z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary_label x # abcdef y # label_1
Extract target from fasta folder:
# create data for second class df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1") file_path_2 <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, file_path_2) # get target from folder vocabulary_label <- paste0("label_", 1:2) gen <- get_generator(path = c(file_path, file_path_2), # one entry for each class train_type = "label_folder", batch_size = 8, maxlen = 6, vocabulary = vocabulary, vocabulary_label = vocabulary_label) z <- gen() x <- z[[1]] y <- z[[2]] x_1_1 <- x[1, , ] colnames(x_1_1) <- vocabulary x_1_1 # first sample from first class x_2_1 <- x[5, , ] colnames(x_2_1) <- vocabulary x_2_1 # first sample from second class colnames(y) <- vocabulary_label y # 4 samples from each class
Extract target from csv file:
# get target from csv file <- c(basename(file_path), "xyz.fasta", "abc.fasta", "x_123.fasta") vocabulary_label <- paste0("label", 1:4) label_1 <- c(1, 0, 0, 0) label_2 <- c(0, 1, 0, 0) label_3 <- c(0, 0, 1, 0) label_4 <- c(0, 0, 0, 1) df <- data.frame(file, label_1, label_2, label_3, label_4) df csv_file <- tempfile(fileext = ".csv") write.csv(df, csv_file, row.names = FALSE) gen <- get_generator(path = file_path, train_type = "label_csv", batch_size = 1, maxlen = 6, target_from_csv = csv_file, vocabulary = vocabulary, vocabulary_label = vocabulary_label) z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary_label x # abcdef y # label_1
Examples for language models follow in the next section.
The output_format
determines the shape of the output for a language model, i.e. part of a sequence is the input $X$ and another the
target $Y$. Assume a sequence abcdefg and maxlen = 6
. Output correspond as follows
"target_right": $X=$ abcdef, $Y=$ g
"target_middle_lstm": $X =$ ($X_1 =$ abc, $X_2 =$ gfe), $Y=$ d (note reversed order of $X_2$)
"target_middle_cnn": $X =$ abcefg, $Y =$ d
"wavenet": $X =$ abcdef, $Y =$ bcdefg
# target_right gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary x # abcdef y # g
# target_middle_lstm gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary, output_format = "target_middle_lstm") z <- gen() x_1 <- z[[1]][[1]][1,,] x_2 <- z[[1]][[2]][1,,] y <- z[[2]] colnames(x_1) <- vocabulary colnames(x_2) <- vocabulary colnames(y) <- vocabulary x_1 # abc x_2 # gfe y # d
# target_middle_cnn gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary, output_format = "target_middle_cnn") z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary x # abcefg y # d
# wavenet gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary, output_format = "wavenet") z <- gen() x <- z[[1]][1,,] y <- z[[2]][1,,] colnames(x) <- vocabulary colnames(y) <- vocabulary x # abcdef y # bcdefg
Number of samples in one batch.
# target_right gen <- get_generator(path = file_path, train_type = "lm", batch_size = 7, maxlen = 6, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]] y <- z[[2]] dim(x) dim(y)
We may determine how frequently we want to take a sample. If step = 1
we take a sample at every possible step.
Let's assume we want to predict the next character, i.e. part of the sequence is the input and next character the
target. If maxlen = 3, step = 1
:
sample: abcdefghiiii
sample: abcdefghiiii
sample: abcdefghiiii
if step = 3
sample: abcdefghiiii
sample: abcdefghiiii
sample: abcdefghiiii
gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 3, vocabulary = vocabulary, step = 3, output_format = "target_right") z <- gen() x <- z[[1]][1,,] #encodes abc y <- z[[2]] # encodes d colnames(x) <- vocabulary colnames(y) <- vocabulary x y # go 3 steps forward z <- gen() x <- z[[1]][1,,] #encodes def y <- z[[2]] # encodes g colnames(x) <- vocabulary colnames(y) <- vocabulary x y
If the sequence is too short to create a single sample, we can pad the sequence with zero-vectors. If padding = FALSE
the generator will go to next file/ fasta entry until it finds a sequence long enough for a sample.
gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 15, # maxlen is longer than sequence vocabulary = vocabulary, step = 3, padding = TRUE, output_format = "target_right") z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary x # first 4 entries are zero-vectors y
A sequence might contain a character that does not lie inside our vocabulary. For example, let's assume we discard e from our vocabulary. We have 4 options to handle this situation
(1) encode as zero vector
vocabulary_2 <- c("a", "b", "c", "d", "f", "g", "h", "i") # exclude "e" from vocabulary # zero gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary_2, output_format = "target_right", ambiguous_nuc = "zeros") z <- gen() x <- z[[1]][1,,] colnames(x) <- vocabulary_2 x # fifth row is zero vector
(2) equal probability
# equal gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary_2, output_format = "target_right", ambiguous_nuc = "equal") z <- gen() x <- z[[1]][1,,] colnames(x) <- vocabulary_2 x # fifth row is 1/8 for every entry
(3) use distribution of current file
# empirical gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary_2, output_format = "target_right", ambiguous_nuc = "empirical") z <- gen() x <- z[[1]][1,,] colnames(x) <- vocabulary_2 x # fifth row is distribuation of file
(4) discard
# discard gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary_2, output_format = "target_right", ambiguous_nuc = "discard") z <- gen() x <- z[[1]][1,,] colnames(x) <- vocabulary_2 x # first sample with only characters from vocabulary is fghiii|i
The proportion_per_seq
argument gives the option to use a random subset instead of the full sequence.
cat("sequence is ", nchar(sequence), "characters long \n") gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 5, seed = 1, vocabulary = vocabulary, output_format = "target_right", # take random subsequence using 50% of sequence proportion_per_seq = 0.5) z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary x # defgh y # i
Integer or NULL. If integer, use only specified number of randomly sampled files for training.
If true, delete file once used. Only applies for rds files.
x <- array(0, dim = c(1,5,4)) y <- matrix(0, ncol = 1) rds_path <- tempfile(fileext = ".rds") saveRDS(list(x, y), rds_path) gen <- get_generator(path = rds_path, delete_used_files = TRUE, train_type = "label_rds", batch_size = 1, maxlen = 5) z <- gen() file.exists(rds_path) # z <- gen() # When calling the generator again, it will wait until it finds a file again from the files listed in # the initial `path` argument. Can be used if another process(es) create rds files.
Only use fixed number of samples per file. Randomly choose which samples to use. (If random_sampling = FALSE
, samples are consecutive.)
gen <- get_generator(path = file_path, train_type = "lm", batch_size = 2, maxlen = 5, step = 1, seed = 3, vocabulary = vocabulary, output_format = "target_right", max_samples = 2) z <- gen() x1 <- z[[1]][1, , ] x2 <- z[[1]][2, , ] colnames(x1) <- vocabulary colnames(x2) <- vocabulary x1 # bcdef x2 # cdefg
If you use max_samples
, generator will randomly choose subset from all possible samples, but those samples are consecutive. With random_sampling = TRUE
,
samples are completely random.
gen <- get_generator(path = file_path, train_type = "lm", batch_size = 2, maxlen = 5, seed = 66, random_sampling = TRUE, vocabulary = vocabulary, output_format = "target_right", max_samples = 2) z <- gen() x1 <- z[[1]][1, , ] x2 <- z[[1]][2, , ] colnames(x1) <- vocabulary colnames(x2) <- vocabulary x1 # efghi x2 # defgh
Target length for language model.
gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, target_len = 3, maxlen = 5, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y1 <- z[[2]][ , 1, ] y2 <- z[[2]][ , 2, ] y3 <- z[[2]][ , 3, ] colnames(x) <- vocabulary names(y1) <- vocabulary names(y2) <- vocabulary names(y3) <- vocabulary x # abcde y1 # f y2 # g y3 # h
Encode target in language model not character wise but combine n characters to one target. n_gram_stride
determines the frequency of
the n-gram encoding.
gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, target_len = 6, n_gram = 3, n_gram_stride = 3, maxlen = 3, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]] y1 <- z[[2]][ , 1, ] y2 <- z[[2]][ , 2, ] dim(x)[3] == length(vocabulary)^3 # x = abc as 3-gram # y1 = def as 3-gram # y2 = ghi as 3-gram
Add noise to input. Must be a list that specifies noise distribution or NULL (no noise).
List contains arguments noise_type
: either "normal"
or "uniform"
.
Optional arguments sd
or mean
if noise_type
is "normal"
(default is sd=1
and mean=0
) or min
, max
if noise_type
is "uniform"
(default is min=0
, max=1
).
gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, add_noise = list(noise_type = "normal", mean = 0, sd = 0.01), maxlen = 5, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary round(x, 3) # abcde + noise y # f
If a fasta file has multiple entries, you can randomly choose a subset.
For example, if the file has 6 entries and proportion_entries = 0.5
the generator will randomly choose only 3 of the entries.
Shuffle file order before iterating through files. Order gets reshuffled after every iteration.
Whether to shuffle fasta entries if fasta file has multiple entries.
If TRUE
, randomly decide for every batch to use original sequence or its reverse complement.
Only implemented for ACGT vocabulary.
Randomly choose new file by sampling according to file size (bigger files more likely).
Character string or NULL
. If not NULL
all entries from file get concatenated to one sequence with concat_seq
string between them.
Use concat_seq = ""
if you don't want to add a new token.
df <- data.frame(Sequence = c("AC", "AG", "AT"), Header = paste0("header", 1:3)) fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) gen <- get_generator(path = fasta_path, train_type = "lm", batch_size = 1, maxlen = 9, vocabulary = c("A", "C", "G", "T", "Z"), concat_seq = "ZZ", output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- c("A", "C", "G", "T", "Z") colnames(y) <- c("A", "C", "G", "T", "Z") x # ACZZAGZZA y # T
When you want to assign one label to set of samples. Only implemented for train_type = "label_folder"
.
Input is a list with the following parameters
samples_per_target
how many samples to use for one targetmaxlen
length of one samplereshape_mode
: "time_dist", "multi_input"
or "concat"
. reshape_mode = "multi_input"
, generator will produce samples_per_target
separate inputs, each of length maxlen
. reshape_mode = "time_dist"
, generator will produce a 4D input array. The dimensions correspond to
(batch_size, samples_per_target, maxlen, length(vocabulary))
. reshape_mode
is "concat"
, generator will concatenate samples_per_target
sequences
of length maxlen
to one long sequence.reshape_mode = "concat"
, there is an additional buffer_len
argument: add new token between
concatenated samplesbuffer_len
is an integer, the sub-sequences are inter spaced with buffer_len
rows. The input length is
(maxlen
* samples_per_target
) + buffer_len
* (samples_per_target
- 1) # create data for second label df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1") file_path_2 <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, file_path_2) # multi_input set_learning <- list(reshape_mode = "multi_input", maxlen = 4, samples_per_target = 3) gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes train_type = "label_folder", batch_size = 2, maxlen = 4, step = 1, vocabulary = vocabulary, set_learning = set_learning) z <- gen() x <- z[[1]] y <- z[[2]] length(x) # 3 samples per target x_1_1 <- x[[1]][1, , ] x_1_1 # abcd x_1_2 <- x[[2]][1, , ] x_1_2 # bcde x_1_3 <- x[[3]][1, , ] x_1_3 # cdef x_2_1 <- x[[1]][2, , ] x_2_1 # aaba x_2_2 <- x[[2]][2, , ] x_2_2 # abaa x_2_3 <- x[[3]][2, , ] x_2_3 # baac colnames(y) <- c("label_1", "label_2") y
# concat set_learning <- list(reshape_mode = "concat", maxlen = 4, samples_per_target = 3) gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes train_type = "label_folder", batch_size = 2, maxlen = 4, step = 2, vocabulary = vocabulary, set_learning = set_learning) z <- gen() x <- z[[1]] y <- z[[2]] dim(x) x_1 <- x[1, , ] colnames(x_1) <- vocabulary x_1 # abcd | cdef | efgh x_2 <- x[2, , ] colnames(x_2) <- vocabulary x_2 # aaba | baac | acaa colnames(y) <- c("label_1", "label_2") y
If TRUE
, instead of one-hot encoding, use quality score of fastq file.
df <- data.frame(Sequence = "ACAGAT", Header = "header_1", Quality = "!#*=?I") fastq_path <- tempfile(fileext = ".fastq") fastq_file <- microseq::writeFastq(df, fastq_path) gen <- get_generator(path = fastq_path, train_type = "lm", batch_size = 1, maxlen = 5, format = "fastq", vocabulary = c("A", "C", "G", "T"), use_quality_score = TRUE, output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- c("A", "C", "G", "T") colnames(y) <- c("A", "C", "G", "T") x # ACAGA y # T
Integer or NULL
. If not NULL
, use coverage as encoding rather than one-hot encoding.
Coverage information must be contained in fasta header: there must be a string "cov_n" in the header, where
n is some integer.
df <- data.frame(Sequence = "ACAGAT", Header = "header_1_cov_8") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) gen <- get_generator(path = fasta_path, train_type = "lm", batch_size = 1, maxlen = 5, vocabulary = c("A", "C", "G", "T"), use_coverage = 25, output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- c("A", "C", "G", "T") colnames(y) <- c("A", "C", "G", "T") x # ACAGA; 0.32 = 8/25 y # T
It is possible to feed a network additional information associated to a sequence. This information needs to be in a csv file. If all sequences in one file share the same label, the csv file should have one column named "file".
We may add some additional input to our dummy data
file <- c(basename(file_path), "some_file_name.fasta") df <- data.frame(file = file, label_1 = c(0, 1), label_2 = c(1, 0), label_3 = c(1, 0)) df write.csv(x = df, file = file.path(dir_path, "add_input.csv"), row.names = FALSE)
If we add the path to the csv file, the generator will map additional input to sequences:
gen <- get_generator(path = dir_path, train_type = "lm", batch_size = 1, maxlen = 5, output_format = "target_right", vocabulary = vocabulary, added_label_path = file.path(dir_path, "add_input.csv"), add_input_as_seq = FALSE) # don't treat added input as sequence z <- gen() added_label_input <- z[[1]][[1]] added_label_input x <- z[[1]][[2]] x[1, , ] y <- z[[2]] y
If we want to train a network with additional labels, we have to add an additional input layer.
model <- create_model_lstm_cnn( maxlen = 5, layer_lstm = c(8, 8), layer_dense = c(4), label_input = 3 # additional input vector has length 3 ) # train_model(train_type = "lm", # model = model, # path = file.path(dir_path, "train_files_1"), # path_val = file.path(dir_path, "validation_files_1"), # added_label_path = file.path(dir_path, "add_input.csv"), # steps_per_epoch = 5, # batch_size = 8, # epochs = 2)
Whether to return integer encoding rather than one-hot encoding.
df <- data.frame(Sequence = "ATCGC", Header = "seq_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) gen <- get_generator(path = fasta_path, train_type = "lm", batch_size = 1, return_int = TRUE, padding = TRUE, maxlen = 8, vocabulary = c("A", "C", "G", "T"), output_format = "target_right") z <- gen() x <- z[[1]] y <- z[[2]] colnames(x) <- c("pad", "pad", "pad", "pad", "A", "T", "C", "G") x colnames(y) <- "C" y
Can also be combined with n-gram encoding:
df <- data.frame(Sequence = "AAACCCTTT", Header = "seq_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) gen <- get_generator(path = fasta_path, train_type = "lm", batch_size = 1, n_gram = 3, n_gram_stride = 3, return_int = TRUE, maxlen = 6, target_len = 3, vocabulary = c("A", "C", "G", "T"), output_format = "target_right") z <- gen() x <- z[[1]] y <- z[[2]] colnames(x) <- c("AAA", "CCC") x colnames(y) <- "TTT" y
Apply some function to the output of a generator call.
df <- data.frame(Sequence = "AAAATTTT", Header = "header_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) fx <- function(x = NULL, y = NULL) { return(x - 1) } fy <- function(x = NULL, y = NULL) { return(exp(y * 5)) } gen <- get_generator(path = fasta_path, reshape_xy = list(x = fx, y = fy), train_type = "label_folder", batch_size = 1, maxlen = 8) z <- gen() x <- z[[1]] x[1,,] y <- z[[2]] y
Masks some parts of input sequence. Can be used for training BERT-like models.
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "") df <- data.frame(Sequence = nt_seq, Header = "seq_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token random_rate = 0.025, # set 2.5% of input to random value identity_rate = 0.05, # leave 5% unchanged include_sw = TRUE) # 0,1 matrix showing where masking was applied gen <- get_generator(path = fasta_path, train_type = "masked_lm", masked_lm = masked_lm, batch_size = 1, n_gram = 1, n_gram_stride = 1, return_int = TRUE, maxlen = 100, vocabulary = c("A", "C", "G", "T")) z <- gen() x <- z[[1]] y <- z[[2]] sw <- z[[3]] df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ]) head(df)
Whenever sw (sample weight) column is 0, x and y columns are identical. Let's look at rows where sw is 1:
df %>% dplyr::filter(sw == 1)
Here 5 is the mask token, this is always the size of the vocabulary + 1.
df %>% dplyr::filter(sw == 1 & x == 5) # 10% masked part df %>% dplyr::filter(sw == 1 & x != 5) # 5% identity part and 2.5% random part (can randomly be the true value)
Can be combined with n-gram encoding and masking of fixed block size:
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "") df <- data.frame(Sequence = nt_seq, Header = "seq_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token random_rate = 0.05, # set 5% of input to random value identity_rate = 0.05, # leave 5% unchanged include_sw = TRUE, # 0,1 matrix showing where masking was applied block_len = 3) # always mask at least 3 tokens in a row gen <- get_generator(path = fasta_path, train_type = "masked_lm", masked_lm = masked_lm, batch_size = 1, n_gram = 3, seed = 12, n_gram_stride = 1, return_int = TRUE, maxlen = 100, vocabulary = c("A", "C", "G", "T")) z <- gen() x <- z[[1]] y <- z[[2]] sw <- z[[3]] df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ], position = 1:ncol(x)) head(df) tail(df)
We can check that sample weights appear only in blocks.
which(sw == 1)
Here 65 is the mask token (4^3 + 1 = size of the vocabulary + 1).
df %>% dplyr::filter(sw == 1 & x == 65) # 10% masked part df %>% dplyr::filter(sw == 1 & x != 65) # 5% identity part and 5% random part (can randomly be the true value)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.