predict_with_n_gram: Predict the next nucleotide using n-gram
In GenomeNet/deepG: Deep Learning for Genome Sequence Data

predict_with_n_gram

R Documentation

Predict the next nucleotide using n-gram

Description

Predict the next nucleotide using n-gram.

Usage

predict_with_n_gram(
  path_input,
  distribution_matrix,
  default_pred = "random",
  vocabulary = c("A", "C", "G", "T"),
  file_sample = NULL,
  format = "fasta",
  return_data_frames = FALSE,
  step = 1
)

Arguments

`path_input`	Path to folder containing fasta files or single fasta file.
`distribution_matrix`	A data frame containing frequency of next nucleotide given the previous n nucleotides (output of `n_gram_dist` function).
`default_pred`	Either character from vocabulary or `"random"`. Will be used as prediction if certain n-gram did not appear before. If `"random"` assign random prediction.
`vocabulary`	Vector of allowed characters, samples outside vocabulary get discarded.
`file_sample`	If integer, size of random sample of files in `path_input`.
`format`	File format, either `"fasta"` or `"fastq"`.
`return_data_frames`	Boolean, whether to return data frame with input, predictions, target position and true target.
`step`	How often to take a sample.

Value

List of prediction evaluations.

Examples

# create dummy fasta files
temp_dir <- tempfile()
dir.create(temp_dir)
create_dummy_data(file_path = temp_dir,
                  num_files = 3,
                  seq_length = 8,
                  vocabulary = c("A", "C", "G", "T"),
                  num_seq = 2)

m <- n_gram_dist(path_input = temp_dir,
                 n = 3,
                 step = 1,
                 nuc_dist = FALSE)

# use distribution matrix to make predictions for one file
predictions <- predict_with_n_gram(path_input = list.files(temp_dir, full.names = TRUE)[1], 
                                   distribution_matrix = m)

# show accuracy
predictions[[1]]

GenomeNet/deepG documentation built on Jan. 25, 2025, 12:05 a.m.