README.md
In hiddengenome/altum: deepG

deepG

deepG is a package for generating LSTM models from genomic text and provides scripts for various common tasks such as the extraction of cell response. It also comes with example datasets of genomic and human-readable languages for testing.

Please see our Wiki for further installation instructions. It covers also usage instructions for multi-GPU machines.

See the help files ?deepG to get started and for questions use the FAQ.

The library comes with mutiple different datasets for testing:

The set data(parenthesis) contains 100k characters of the parenthesis synthetic language generated from a very simple counting language with a parenthesis and letter alphabet Σ = {( ) 0 1 2 3 4 }. The language is constrained to match parentheses, and nesting is limited to at most 4 levels deep. Each opening parenthesis increases and each closing parenthesis decreases the nesting level, respectively. Numbers are generated randomly, but are constrained to indicate the nesting level at their position.
The set data(crispr_full) containing all CRISPR loci found in NCBI representative genomes with neighbor nucleotides up and downstream.
The set data(crispr_sample) containing a subset of data(crispr_full).
The set data(ecoli) contains the E. coli genome, see the genome sequence of Escherichia coli K-12.
The set data(ecoli_small) contains a subset of data(ecoli).

library(deepG)
data("ecoli") # loads the nucleotide sequence of E. coli
preprocessed <- preprocessSemiRedundant(substr(ecoli, 2, 5000), maxlen = 250) # prepares the batches (one-hot encoding)

Will generate the binary file example_full_model.hdf5. For more options see the Wiki Training of GenomeNet.

trainNetwork(dataset = preprocessed, batch.size = 500, epochs = 5, maxlen = 250, layers.lstm = 2, layer.size = 25, use.cudnn = F, run.name = "example", tensorboard.log = "log", path.val = "", output = list(none = FALSE, checkpoints =FALSE, tensorboard = FALSE, log = FALSE, serialize_model = FALSE, full_model = TRUE))

We can use now the trained model to generated neuron responses (states) for a suset of the E coli genome. This will generate a binary file named states.h5

writeStates(model.path = "example_full_model.hdf5", sequence = substr(ecoli, 2, 5000), batch.size = 256, layer.depth = 1, filename = "states", vocabulary = c("a","g","c","t"), step = 1, padding = TRUE)