prepare_data_from_FASTA: Generate one-hot encoding of sequences given as FASTA file

Description Usage Arguments Value See Also Examples

View source: R/prepare_data_from_FASTA.R

Description

Given a set of sequences in a FASTA file this function returns a sparse matrix with one-hot encoded sequences. In this matrix, the sequence features are along rows, and sequences along columns. Currently, mono- and dinucleotide features for DNA sequences are supported. Therefore, the length of the feature vector is 4 and 16 times the length of the sequences (since the DNA alphabet is four characters) for mono- and dinucleotide features respectively.

Usage

1
prepare_data_from_FASTA(fasta_fname, raw_seq = FALSE, sinuc_or_dinuc = "sinuc")

Arguments

fasta_fname

Provide the name (with complete path) of the input FASTA file.

raw_seq

TRUE or FALSE, set this to TRUE if you want the raw sequences.

sinuc_or_dinuc

character string, 'sinuc' or 'dinuc' to select for mono- or dinucleotide profiles.

Value

A sparse matrix of sequences represented with one-hot-encoding.

See Also

get_one_hot_encoded_seqs for directly using a DNAStringSet object

Other input functions: get_one_hot_encoded_seqs()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
fname <- system.file("extdata", "example_data.fa", 
                        package = "archR", mustWork = TRUE)

# mononucleotides feature matrix
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
                        sinuc_or_dinuc = "sinuc")

# dinucleotides feature matrix
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
                        sinuc_or_dinuc = "dinuc")
                       
# FASTA sequences as a Biostrings::DNAStringSet object
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
                        raw_seq = TRUE)

snikumbh/archR documentation built on July 5, 2021, 8:46 a.m.