predict_model: Make prediction for nucleotide sequence or entries in...
In GenomeNet/deepG: Deep Learning for Genome Sequence Data

predict_model

R Documentation

Make prediction for nucleotide sequence or entries in fasta/fastq file

Description

Removes layers (optional) from pretrained model and calculates states of fasta/fastq file or nucleotide sequence. Writes states to h5 or csv file (access content of h5 output with load_prediction function). There are several options on how to process an input file:

If "one_seq", computes prediction for sequence argument or fasta/fastq file. Combines fasta entries in file to one sequence. This means predictor sequences can contain elements from more than one fasta entry.
If "by_entry", will output a separate file for each fasta/fastq entry. Names of output files are: output_dir + "Nr" + i + filename + output_type, where i is the number of the fasta entry.
If "by_entry_one_file", will store prediction for all fasta entries in one h5 file.
If "one_pred_per_entry", will make one prediction for each entry by either picking random sample for long sequences or pad sequence for short sequences.

Usage

predict_model(
  model,
  output_format = "one_seq",
  layer_name = NULL,
  sequence = NULL,
  path_input = NULL,
  round_digits = NULL,
  filename = "states.h5",
  step = 1,
  vocabulary = c("a", "c", "g", "t"),
  batch_size = 256,
  verbose = TRUE,
  return_states = FALSE,
  output_type = "h5",
  padding = "none",
  use_quality = FALSE,
  quality_string = NULL,
  mode = "label",
  lm_format = "target_right",
  output_dir = NULL,
  format = "fasta",
  include_seq = FALSE,
  reverse_complement_encoding = FALSE,
  ambiguous_nuc = "zero",
  ...
)

Arguments

`model`	A keras model.
`output_format`	Either `"one_seq"`, `"by_entry"`, `"by_entry_one_file"`, `"one_pred_per_entry"`.
`layer_name`	Name of layer to get output from. If `NULL`, will use the last layer.
`sequence`	Character string, ignores path_input if argument given.
`path_input`	Path to fasta file.
`round_digits`	Number of decimal places.
`filename`	Filename to store states in. No file output if argument is `NULL`. If `output_format = "by_entry"`, adds "nr" + "i" after name, where i is entry number.
`step`	Frequency of sampling steps.
`vocabulary`	Vector of allowed characters. Characters outside vocabulary get encoded as specified in `ambiguous_nuc`.
`batch_size`	Number of samples used for one network update.
`verbose`	Boolean.
`return_states`	Return predictions as data frame. Only supported for output_format `"one_seq"`.
`output_type`	`"h5"` or `"csv"`. If ⁠output_format`` is ⁠"by_entries_one_file", "one_pred_per_entry"`⁠can only be⁠`"h5"'.
`padding`	Either `"none"`, `"maxlen"`, `"standard"` or `"self"`. If `"none"`, apply no padding and skip sequences that are too short. If `"maxlen"`, pad with maxlen number of zeros vectors. If `"standard"`, pad with zero vectors only if sequence is shorter than maxlen. Pads to minimum size required for one prediction. If `"self"`, concatenate sequence with itself until sequence is long enough for one prediction. Example: if sequence is "ACGT" and maxlen is 10, make prediction for "ACGTACGTAC". Only applied if sequence is shorter than maxlen.
`use_quality`	Whether to use quality scores.
`quality_string`	String for encoding with quality scores (as used in fastq format).
`mode`	Either `"lm"` for language model or `"label"` for label classification.
`lm_format`	Either `"target_right"`, `"target_middle_lstm"`, `"target_middle_cnn"` or `"wavenet"`.
`output_dir`	Directory for file output.
`format`	File format, `"fasta"`, `"fastq"`, `"rds"` or `"fasta.tar.gz"`, `"fastq.tar.gz"` for `tar.gz` files.
`include_seq`	Whether to include input sequence in h5 file.
`reverse_complement_encoding`	Whether to use both original sequence and reverse complement as two input sequences.
`ambiguous_nuc`	How to handle nucleotides outside vocabulary, either `"zero"`, `"discard"`, `"empirical"` or `"equal"`. If `"zero"`, input gets encoded as zero vector. If `"equal"`, input is repetition of `1/length(vocabulary)`. If `"discard"`, samples containing nucleotides outside vocabulary get discarded. If `"empirical"`, use nucleotide distribution of current file.
`...`	Further arguments for sequence encoding with `seq_encoding_label`.

Value

If return_states = TRUE returns a list of model predictions and position of corresponding sequences. If additionally include_seq = TRUE, list contains sequence strings. If return_states = FALSE returns nothing, just writes output to file(s).

Examples


# make prediction for single sequence and write to h5 file
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
vocabulary <- c("a", "c", "g", "t")
sequence <- paste(sample(vocabulary, 200, replace = TRUE), collapse = "")
output_file <- tempfile(fileext = ".h5")
predict_model(output_format = "one_seq", model = model, step = 10,
             sequence = sequence, filename = output_file, mode = "label")

# make prediction for fasta file with multiple entries, write output to separate h5 files
fasta_path <- tempfile(fileext = ".fasta")
create_dummy_data(file_path = fasta_path, num_files = 1,
                 num_seq = 5, seq_length = 100,
                 write_to_file_path = TRUE)
model <- create_model_lstm_cnn(maxlen = 20, layer_lstm = 8, layer_dense = 2, verbose = FALSE)
output_dir <- tempfile()
dir.create(output_dir)
predict_model(output_format = "by_entry", model = model, step = 10, verbose = FALSE,
               output_dir = output_dir, mode = "label", path_input = fasta_path)
list.files(output_dir)

GenomeNet/deepG documentation built on Jan. 25, 2025, 12:05 a.m.