README.md

genomeutils

This package provides a set of helper tools that allow automating some common tasks one encounters during routine genome analysis.

Fast loading and writing of files and genome objects

Using the data.table package for performing these functions really fast.

Load from a file Write to a file Read .fasta/fastq to a Genome object Write a Genome object to file Read .fasta to a Proteome object Write a Proteome object to file

Fetch gene attributes and functional analysis

Fetch the gene length (coding exons) for a list of genes Fetch the GC content of the sequence for a list of genes Perform Gene Ontology Enrichment analysis for a set of interesting genes

Normalization and Factorization methods for gene read-count matrices

Min max normalization Row median/deviation normalization Sample specific normalization Upper Quartile Normalization Gene counts to expression in 'Counts per million': (CPM) Gene counts to expression in 'Transcripts per million': (TPM) Gene counts to expression in 'Relative Log Expression': (RLE) as used in DESeq Invertibility of matrix Principal Component Analysis + 2D plots Multi-Dimensional Scaling + 2D plots Singular Value Decomposition + 2D plots

Pretty plotting functions

Modify heirarchical clustering to produce a plot colored by groups Produce an MA plot (to identify differentially expressed genes, for instance) Produce fancy heatmaps Produce a smooth histogram by modifying base R plotting parameters

Hypothesis testing for matched groups of data

Includes ordering comparisons by significance

T-tests F-tests Significance of difference of means test Significance of difference of variances test Wilcoxon tests Kolmogorov–Smirnov test

Maximum likelihood and Bayesian inference

Maximum likelihood estimation of Gaussian distribution parameters + AIC Maximum likelihood estimation of Weibull distribution parameters + AIC Implementation of binomial generalized linear model + AIC + BIC Bayesian posterior estimation for a mixture of betas Bayesian posterior estimation for binomial beta distribution

Machine learning

Support Vector Machine classifier Naive Bayes Classifier Random Forest Classifier Linear Discriminant Analysis Limma linear model for differential gene expression and computation of Residual sum of squares for downstream analysis

Variant Analysis (GWAS)

Carries out genome-wide association analysis using parallelized code to perform it really fast. Input genotype and phenotype data and get significance measures by fitting a generalized linear model per SNP.



ssarda/genomeutils documentation built on May 30, 2019, 8:42 a.m.