df2bits: Calculate the information content expressed in bits of...

View source: R/pwm_utils.R

df2bitsR Documentation

Calculate the information content expressed in bits of sequences stored in a data.frame

Description

This function calculates the information content expressed in bits using the Shannon entropy. Check the details for full explanation and formulas. However, currently there's no support for LaTeX syntax for subscript text and fractions. To display them properly once could copy-paste the details section in Overleaf.

Usage

df2bits(
  data,
  ID_col,
  alphabet,
  small_n_correction = FALSE,
  long_format = FALSE,
  ignore_case = FALSE
)

Arguments

data

A data.frame with a minimum of 2 columns. One named Sequence, the other named as you prefer that will be specified with ID_col.

ID_col

The name of the column in data to be used as the identifier of the Sequence column.

alphabet

A character vector containing the alphabet letters present in Sequence. Guessed by default.

small_n_correction

Apply a small correction to the Shannon Entropy. See details. Default FALSE.

long_format

Logical. If TRUE reshape the bits into a tidy long data.frame format. Default FALSE.

ignore_case

Logical. If TRUE the length of the alphabet is calculated ignoring the case of the alphabet. Meaning that the maximum bits height will calculated on the case-insensitive length of the alphabet. See notes for more explanation. Default FALSE.

Details

Given an alphabet of letters of length W where every letter defined as l for which l belongs to W, we can represent the DNA alphabet as l' belongs to A,C,G,T where W = 4. With a multiple sequence alignment of N sequences of length I we denote the information content expressed in bits of the letter l at position i with bits_l_,_i we define the following formula

bits_l_,_i = R(l,i) \times ( log_2(W) - (H_i + \epsilon) )

where H_i is the Shannon entropy representing the uncertainty of position i is defined as:

-\sum_{i = 1}^{W} { p_l_i \times log_2 p_l_,_i }

where p_l_i is the relative frequency (a.k.a. probability) of letter l at position i; \epsilon is the approximation for small-sample corrections, i.e. a correction for an alignment of N sequences in the alignment defined as

\epsilon = \frac{1}{log_e{2}} \times \frac{W-1}{2N}

and R(l,i) sequences position probability matrix containing the p_l_i for N sequences.

Value

A data.frame or a tidy long format data.frame

Note

When having an upper and lower case DNA sequence, with an alphabet that as both 'ATGC' and 'atgc' one case force the maximum information content to log2(4) instead of log2(8) by doing ignore_case = TRUE.

Examples

df2bits(data, ID_col = 'Species', 
        alphabet = c('a', 'c', 'g', 't'), 
        small_n_correction = F, 
        long_format = T)

Ni-Ar/niar documentation built on Feb. 3, 2025, 9:25 a.m.