encode_onehot: Perform one-hot encoding of any character string

View source: R/general_utils.R

encode_onehotR Documentation

Perform one-hot encoding of any character string

Description

Given any sequence return a data.frame where every column corresponds to one letter of the input string and each row corresponds to a letter as set in the alphabet_order. Each cell in the data.frame will be a zero, only when the column equals the row it will be a 1. For example, in the case of DNA each nucleotide will be represented as a vector of length 4, where 3 positions are 0 and only one position is 1, depending on the nucleotide.

Usage

encode_onehot(string, alphabet_order = c("A", "T", "C", "G"))

Arguments

string

A string of letters to be one-hot encoded. Case-sensitive!

alphabet_order

A one-letter character vector specifying the alphabet order. It must contain every letter present in string. If missing the alphabetical order will be used. Default is DNA alphabet: c('A', 'T', 'C', 'G').

Details

If alphabet_order is not specified the function will create one on the fly sorting the input string by alphabetical order. If a letter in the string character vector is not present in the alphabet_order the function will return an error. Special characters and numbers are allowed (see examples).

Value

a data.frame

Author(s)

Niccolò Arecco

Examples

# RNA example
encode_onehot(string = "UUUAAACCCGG", alphabet_order = c('A', 'U', 'C', 'G'))
# Returns the following data.frame where the rows are ordered as in the 
# alphabet_order and the columns are ordered as the input string.
  U U U A A A C C C G G G
A 0 0 0 1 1 1 0 0 0 0 0 0
U 1 1 1 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 1 1 1
C 0 0 0 0 0 0 1 1 1 0 0 0

# Alphabet is optional 
encode_onehot(string = 'ATHCAY')
# Returns the following data.frame where the input string was sorted 
# alphabetically to generate the order on the rows
  A T H C A Y
A 1 0 0 0 1 0
C 0 0 0 1 0 0
H 0 0 1 0 0 0
T 0 1 0 0 0 0
Y 0 0 0 0 0 1

# Case sensitive input
encode_onehot(string = 'acACCnN')
# Returns the following data.frame where lower case letters appear before upper case one
  a c A C C n N
a 1 0 0 0 0 0 0
A 0 0 1 0 0 0 0
c 0 1 0 0 0 0 0
C 0 0 0 1 1 0 0
n 0 0 0 0 0 1 0
N 0 0 0 0 0 0 1

# Special characters and numbers are encoded just fine
encode_onehot(string = 'MaQ8T!S-K C2C*')
# Returns a data.frame where symbols are sorted first. 
# Note how the space (' ') is both a row name and column name
  M a Q 8 T ! S - K   C 2 C *
  0 0 0 0 0 0 0 0 0 1 0 0 0 0
- 0 0 0 0 0 0 0 1 0 0 0 0 0 0
! 0 0 0 0 0 1 0 0 0 0 0 0 0 0
* 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0 0 1 0 0
8 0 0 0 1 0 0 0 0 0 0 0 0 0 0
a 0 1 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 1 0 1 0
K 0 0 0 0 0 0 0 0 1 0 0 0 0 0
M 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Q 0 0 1 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 1 0 0 0 0 0 0 0
T 0 0 0 0 1 0 0 0 0 0 0 0 0 0


Ni-Ar/niar documentation built on Feb. 3, 2025, 9:25 a.m.