Home

/

GitHub

/

README.md
In theandyb/aeffp: Lasso-penalized Logistic Regression via ADMM

aeffp

Andy's 815 Final Project: Using lasso-penalized logistic regression to explore the influence of local nucleotide sequence context on mutation rate heterogeneity.

devtools::install_github("theandyb/aeffp")

My method assumes that the response variable has been coded in a -1/+1 scheme. Given a matrix of predictor variables X and a corresponding vector of observations y, you could run

admmlasso_logC(X, y, lambda)

where lambda is the tuning parameter for the lasso penalty. There is also a raw R implementation avaialable (admmlasso_log), though this was something I used to prototype this and I would not suggest you use this.

If your data is grouped (i.e. you have a count of hits and misses), you could use the function admmlasso_logC_tabled, where instead of a vector of observations you provide a matrix with two columns, the first containing the "hits" (successes) and the second containing the count of "misses" (failures).

Simulation experiments. Verified that my implementation produces the same output as Stephen Boyd's matlab implementation and that my method is a bit slower than glmnet.

Example of using glmnet to fit penalized logistic regression model using motifs as predictors.

Assuming variant calls are available in a vcf/bcf file, run a command similar to this to extract information about the singletons (note: not all the fields here need to be extracte, but following steps might assume these fields are present in the output)

bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\t%DP\t%NS\n" /path/to/vcf >> /desired/output/path.txt

This should result in a file that looks like the following:

10      60524   G       T       34140   3764
10      60606   T       C       10843   3473
10      60765   G       A       29355   3753
10      60776   A       G       33805   3757
10      60828   A       G       47986   3761

Each site in the above output needs to be annotated with the local nucleotide context. This requires a reference genome and a script to pull the sequence from the reference for each singleton (an example can be found in the inst/perl directory -> annotate.pl)
Generate a count of the number of times each motif occurs in the reference genome. A script that does this is available in the inst/python directory (motif_count.py). This script requires a text file that lists the motifs you want to count (one per line); an additional script is included (gen_motif.py) to generate this file for you.

python gen_motif.py 
python motif_count.py -i /refGenome.fasta -m motifs5.txt -o output/path

You also need to match each motif with its reverse-complement and sum the two together.

A very simplified analysis can be found in the jupyter notebook titled

theandyb/aeffp documentation built on May 8, 2019, 9:09 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

theandyb/aeffp
Lasso-penalized Logistic Regression via ADMM

README.md
In theandyb/aeffp: Lasso-penalized Logistic Regression via ADMM

aeffp

Installation

Basic Usage

Examples

testing.Rmd

real_data_application.ipyn

Application: local nucleotide context and mutation rate heterogeneity

Preparing Data

Example Analysis

R Package Documentation

Browse R Packages

We want your feedback!

theandyb/aeffp Lasso-penalized Logistic Regression via ADMM

README.md In theandyb/aeffp: Lasso-penalized Logistic Regression via ADMM

aeffp

Installation

Basic Usage

Examples

testing.Rmd

real_data_application.ipyn

Application: local nucleotide context and mutation rate heterogeneity

Preparing Data

Example Analysis

R Package Documentation

Browse R Packages

We want your feedback!

theandyb/aeffp
Lasso-penalized Logistic Regression via ADMM

README.md
In theandyb/aeffp: Lasso-penalized Logistic Regression via ADMM