View source: R/general_utils.R
encode_onehot | R Documentation |
Given any sequence return a data.frame where every column corresponds to one
letter of the input string
and each row corresponds to a letter as set in the
alphabet_order
. Each cell in the data.frame will be a zero, only when the column
equals the row it will be a 1
.
For example, in the case of DNA each nucleotide will be represented as a vector of
length 4, where 3 positions are 0 and only one position is 1, depending on the nucleotide.
encode_onehot(string, alphabet_order = c("A", "T", "C", "G"))
string |
A string of letters to be one-hot encoded. Case-sensitive! |
alphabet_order |
A one-letter character vector specifying the alphabet order.
It must contain every letter present in |
If alphabet_order
is not specified the function will create one on the fly
sorting the input string
by alphabetical order. If a letter in the string
character vector is not present in the alphabet_order
the function will
return an error. Special characters and numbers are allowed (see examples).
a data.frame
Niccolò Arecco
# RNA example
encode_onehot(string = "UUUAAACCCGG", alphabet_order = c('A', 'U', 'C', 'G'))
# Returns the following data.frame where the rows are ordered as in the
# alphabet_order and the columns are ordered as the input string.
U U U A A A C C C G G G
A 0 0 0 1 1 1 0 0 0 0 0 0
U 1 1 1 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0 1 1 1
C 0 0 0 0 0 0 1 1 1 0 0 0
# Alphabet is optional
encode_onehot(string = 'ATHCAY')
# Returns the following data.frame where the input string was sorted
# alphabetically to generate the order on the rows
A T H C A Y
A 1 0 0 0 1 0
C 0 0 0 1 0 0
H 0 0 1 0 0 0
T 0 1 0 0 0 0
Y 0 0 0 0 0 1
# Case sensitive input
encode_onehot(string = 'acACCnN')
# Returns the following data.frame where lower case letters appear before upper case one
a c A C C n N
a 1 0 0 0 0 0 0
A 0 0 1 0 0 0 0
c 0 1 0 0 0 0 0
C 0 0 0 1 1 0 0
n 0 0 0 0 0 1 0
N 0 0 0 0 0 0 1
# Special characters and numbers are encoded just fine
encode_onehot(string = 'MaQ8T!S-K C2C*')
# Returns a data.frame where symbols are sorted first.
# Note how the space (' ') is both a row name and column name
M a Q 8 T ! S - K C 2 C *
0 0 0 0 0 0 0 0 0 1 0 0 0 0
- 0 0 0 0 0 0 0 1 0 0 0 0 0 0
! 0 0 0 0 0 1 0 0 0 0 0 0 0 0
* 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0 0 1 0 0
8 0 0 0 1 0 0 0 0 0 0 0 0 0 0
a 0 1 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 1 0 1 0
K 0 0 0 0 0 0 0 0 1 0 0 0 0 0
M 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Q 0 0 1 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 1 0 0 0 0 0 0 0
T 0 0 0 0 1 0 0 0 0 0 0 0 0 0
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.