evaluate_model: Evaluates a trained model on fasta, fastq or rds files
In GenomeNet/deepG: Deep Learning for Genome Sequence Data

evaluate_model

R Documentation

Evaluates a trained model on fasta, fastq or rds files

Description

Returns evaluation metric like confusion matrix, loss, AUC, AUPRC, MAE, MSE (depending on output layer).

Usage

evaluate_model(
  path_input,
  model = NULL,
  batch_size = 100,
  step = 1,
  padding = FALSE,
  vocabulary = c("a", "c", "g", "t"),
  vocabulary_label = list(c("a", "c", "g", "t")),
  number_batches = 10,
  format = "fasta",
  target_middle = FALSE,
  mode = "lm",
  output_format = "target_right",
  ambiguous_nuc = "zero",
  evaluate_all_files = FALSE,
  verbose = TRUE,
  max_iter = 20000,
  target_from_csv = NULL,
  max_samples = NULL,
  proportion_per_seq = NULL,
  concat_seq = NULL,
  seed = 1234,
  auc = FALSE,
  auprc = FALSE,
  path_pred_list = NULL,
  exact_num_samples = NULL,
  activations = NULL,
  shuffle_file_order = FALSE,
  include_seq = FALSE,
  ...
)

Arguments

`path_input`	Input directory where fasta, fastq or rds files are located.
`model`	A keras model.
`batch_size`	Number of samples per batch.
`step`	How often to take a sample.
`padding`	Whether to pad sequences too short for one sample with zeros.
`vocabulary`	Vector of allowed characters. Character outside vocabulary get encoded as specified in ambiguous_nuc.
`vocabulary_label`	List of labels for targets of each output layer.
`number_batches`	How many batches to evaluate.
`format`	File format, `"fasta"`, `"fastq"` or `"rds"`.
`target_middle`	Whether model is language model with separate input layers.
`mode`	Either `"lm"` for language model or `"label_header"`, `"label_csv"` or `"label_folder"` for label classification.
`output_format`	Determines shape of output tensor for language model. Either `"target_right"`, `"target_middle_lstm"`, `"target_middle_cnn"` or `"wavenet"`. Assume a sequence `"AACCGTA"`. Output correspond as follows `⁠"target_right": X = "AACCGT", Y = "A"⁠` `⁠"target_middle_lstm": X = (X_1 = "AAC", X_2 = "ATG"), Y = "C"⁠` (note reversed order of X_2) `⁠"target_middle_cnn": X = "AACGTA", Y = "C"⁠` `⁠"wavenet": X = "AACCGT", Y = "ACCGTA"⁠`
`ambiguous_nuc`	How to handle nucleotides outside vocabulary, either `"zero"`, `"discard"`, `"empirical"` or `"equal"`. If `"zero"`, input gets encoded as zero vector. If `"equal"`, input is repetition of `1/length(vocabulary)`. If `"discard"`, samples containing nucleotides outside vocabulary get discarded. If `"empirical"`, use nucleotide distribution of current file.
`evaluate_all_files`	Boolean, if `TRUE` will iterate over all files in `path_input` once. `number_batches` will be overwritten.
`verbose`	Boolean.
`max_iter`	Stop after `max_iter` number of iterations failed to produce a new batch.
`target_from_csv`	Path to csv file with target mapping. One column should be called "file" and other entries in row are the targets.
`max_samples`	Maximum number of samples to use from one file. If not `NULL` and file has more than `max_samples` samples, will randomly choose a subset of `max_samples` samples.
`proportion_per_seq`	Numerical value between 0 and 1. Proportion of sequence to take samples from (use random subsequence).
`concat_seq`	Character string or `NULL`. If not `NULL` all entries from file get concatenated to one sequence with `concat_seq` string between them. Example: If 1.entry AACC, 2. entry TTTG and `concat_seq = "ZZZ"` this becomes AACCZZZTTTG.
`seed`	Sets seed for `set.seed` function for reproducible results.
`auc`	Whether to include AUC metric. If output layer activation is `"softmax"`, only possible for 2 targets. Computes the average if output layer has sigmoid activation and multiple targets.
`auprc`	Whether to include AUPRC metric. If output layer activation is `"softmax"`, only possible for 2 targets. Computes the average if output layer has sigmoid activation and multiple targets.
`path_pred_list`	Path to store list of predictions (output of output layers) and corresponding true labels as rds file.
`exact_num_samples`	Exact number of samples to evaluate. If you want to evaluate a number of samples not divisible by batch_size. Useful if you want to evaluate a data set exactly ones and know the number of samples already. Should be a vector if `mode = "label_folder"` (with same length as `vocabulary_label`) and else an integer.
`activations`	List containing output formats for output layers (`⁠softmax, sigmoid⁠` or `linear`). If `NULL`, will be estimated from model.
`shuffle_file_order`	Logical, whether to go through files randomly or sequentially.
`include_seq`	Whether to store input. Only applies if `path_pred_list` is not `NULL`.
`...`	Further generator options. See `get_generator`.

Value

A list of evaluation results. Each list element corresponds to an output layer of the model.

Examples


# create dummy data
path_input <- tempfile()
dir.create(path_input)
create_dummy_data(file_path = path_input,
                  num_files = 3,
                  seq_length = 11, 
                  num_seq = 5,
                  vocabulary = c("a", "c", "g", "t"))
# create model
model <- create_model_lstm_cnn(layer_lstm = 8, layer_dense = 4, maxlen = 10, verbose = FALSE)
# evaluate
evaluate_model(path_input = path_input,
  model = model,
  step = 11,
  vocabulary = c("a", "c", "g", "t"),
  vocabulary_label = list(c("a", "c", "g", "t")),
  mode = "lm",
  output_format = "target_right",
  evaluate_all_files = TRUE,
  verbose = FALSE)

GenomeNet/deepG documentation built on Jan. 25, 2025, 12:05 a.m.