README.md

Tool Mania

A chaotic bunch of tools. This package serves mostly as an incubator for tools and scipts. Once something reaches a critical mass, it will be split out as its own package.

The main reasons for the existance of this package are: Decrease the startup time for beginning a new tool by having existing infrastructure. Encourage code re-use by putting each new tool in a context where there will already be lots of data manipulation functions available. * Rapidly deploy code in pre-alpha version to collaborators.

Currently there is only one toolset being developed in this package: * pentropy - positional entropy tools for sets of longitudinal sequences.

pentropy

Read in a single FASTA file (DNA or AA) with aligned sequences at different time points for a patient. Then compute the entropy at each position at each time point and plot this information. The extension of this work is to work with the actual AA / nucleotide distribution at each position instead of the entropies.

Notes on data structures

First version of data structures

The FASTA files are read into DNAStringSets (or AAStringSets). The name of each sequence must be specifically formatted so that it is easy to extract only the data at the different time points. This is not a fixed format at the moment - will re-evaluate it once the new data arrives.

The entropy data for each time point is stored in a data.frame with the columns: pos entropy

This is extended to include multiple time points by just adding a column: time_point pos * entropy

ggplot can easily use these input formats to produce plots and it is easy to feed to R's clustering algorithms also.

Extention of these data structures

An additional data structure will be required in which pos is collapsed out. This is unavoidable.

To accomodate more metrics or the possiblity of tracking the exact AA/nucleotide distribution at each time point, two extensions are possible: add more columns add a column called 'variable' and change the current entropy column to 'value' The second option is the winner since it just scales better. However, performance might become an issue. Will have to keep an eye on this and see if more efficient data structures might be needed.

These extensions will be handled by this approach: Add more metrics (different entropies, variance, ...) Add the nucleotide / AA frequency counts (what about percentages?)

Some situations in which the positions must be collapsed out: Loop lengths Glycoselation sites * charge

Current design of data structures

The following data structures are used: one_pos_data (data for positions at one point (usually time point) * pos * variable * value all_pos_data (all data for all timepoints) * time_point * pos * variable * value * aggre_pos_data (aggregated data over all points for each position) * pos * variable * value



philliplab/toolmania documentation built on May 25, 2019, 5:06 a.m.