n_gram_dist: Get distribution of n-grams
In GenomeNet/deepG: Deep Learning for Genome Sequence Data

n_gram_dist

R Documentation

Get distribution of n-grams

Description

Get distribution of next character given previous n nucleotides.

Usage

n_gram_dist(
  path_input,
  n = 2,
  vocabulary = c("A", "C", "G", "T"),
  format = "fasta",
  file_sample = NULL,
  step = 1,
  nuc_dist = FALSE
)

Arguments

`path_input`	Path to folder containing fasta files or single fasta file.
`n`	Size of n gram.
`vocabulary`	Vector of allowed characters, samples outside vocabulary get discarded.
`format`	File format, either `"fasta"` or `"fastq"`.
`file_sample`	If integer, size of random sample of files in `path_input`.
`step`	How often to take a sample.
`nuc_dist`	Nucleotide distribution.

Value

Returns a matrix with distributions of nucleotides given the previous n nucleotides.

A data frame of n-gram predictions.

Examples

temp_dir <- tempfile()
dir.create(temp_dir)
create_dummy_data(file_path = temp_dir,
                  num_files = 3,
                  seq_length = 80,
                  vocabulary = c("A", "C", "G", "T"),
                  num_seq = 2)

m <- n_gram_dist(path_input = temp_dir,
                 n = 3,
                 step = 1,
                 nuc_dist = FALSE)
head(round(m, 2))

GenomeNet/deepG documentation built on Jan. 25, 2025, 12:05 a.m.