README.md

Omer Acar September 30, 2019

pgsNetwork

The scripts in R folder and Rmd files in vignettes folder are used for data generation, tidying and analysis for -------citation here-------

Installation

to replicate all analysis and get the figures in the paper I wrote a makefile. if you clone this repo, get raw data into analysis/data/raw_data folder from thecellmap.org and run make in the command line, all figures should be saved into figures folder in the repo directory.

Dependencies

makefile installs R and python dependencies. You should have R version >3.6 and python version > 3.7. You also need to install markov clustering software from https://micans.org/mcl/. Makefile uses mcxload, mcl, mcxdump for markov clustering. If you are not interested in that, you can comment out lines related with clustering in the makefile.

Raw data

The raw data for this project related with double mutant data taken from thecellmap.org. The files should be downloaded and copied into analysis/data/raw_data folder. The overlapping orfs and genomic_neighbors dataframes were taken from previous work which i don't have script for but it has all orf names which has an overlap with another annotated orf in yeast genome.

For interaction data analysis SGA_NxN, SGA_ExN_NxE and SGA_ExE datasets were used For Pearson correlation analysis pcc_all was used

All files created by R/create_* scripts will be saved in analysis/data/derived_data folder. files generated by R/create_ scripts will be used as inputs for most of other scripts.

Proto-gene list and figure themes were sourced using .Rprofile file. Thus please don't use Rscript --vanilla option (this won't source the .Rprofile)

Combined interaction data

I combined interaction data into a single dataframe and saved as rds file for faster reading using 'R/create_SGA_data_combined.R'

Allele name -- orf name mappings

Even though cellmap data has strain_ids data as an excel file, it has some problematic lines when reading due to an extra ',' character and I didn't need single mutant fitnesses, thus I created strain_ids dataframe using R/create_strain_ids.R script in order to map allele names to SGD orf names by combining all unique pairs of ID&allele name pairs in the interaction datasets. This file currently removes genes with suppressor mutations, i.e. the ones with 'supp' in their allele names as these interaction data is not a 'real' deletion interaction.

Number of experiments and experiment category

Following creation of strain_ids dataframe, I used interaction data frame to add # of experiments every gene was tested to strain_ids dataframe using R/create_strain_ids_with_experiment_count.R script which will also adds experiment category for non-essential genes and saves these data into 2 separate csv files.

Different alleles

R/create_df_different_alleles.R script creates a dataframe named 'df_different_alleles.rds' which has orf names in the first column and allele names in the second column. Second column can have multiple values which I then used to find multi-allele genes for random allele selection during simulations.

Interaction density analysis

R/calculateInteractionDensity.R script is used for calculating interaction densities and figure xx

Distance to neighbors

R/neighborDistanceAnalysis.R file is used for calculations of distance to neighbors on PCC network and get fig xx

Regulatory markers analysis

R/regulatoryAnalysis.R was used to create figures with nucleosome free regions and distance to TSS[fig xx]. This analysis requires data from [ref to ATAC-seq] and [ref to TIF-seq] and they are used as the same names provided with those papers. The inputs should be in analysis/data/raw_data/.

Interaction Network analysis

R/interactionNetworkAnalysis.R file reads files mentioned above and runs simulations using different methods. The file's last cell should create interaction network graphs on the manuscript (fig xx) as well as connectivity analysis (fig xx) and also fig xx(proportion of protogenes/nonessential with interactions)

Pearson correlation network analysis

R/pccNetworkAnalysis.R creates same set of plots as in Interaction network analysis. However since I used different set of PCC calculations the 2nd chunk which reads the input files has more code than interactionNetworkAnalysis.R please see the scripts for details

Clustering analysis

link community clustering was run on the pcc network using github.com/Nathaniel-Rodriguez/linkcom package.

markov clustering was run on the pcc network using mcl binaries provided on https://micans.org/mcl/ x script should get you same result I had and R/clusteringAnalysis_combined.R creates figures by reading the inputs generated by script x

Synteny heatmap

The workflow for this analysis is following: - I had syntenic block of Scer,Spar,Smik,Skud,Seub,Suva,Sarb,Sjur and wrote python scripts (can be reached from github.com/oacar/SynORFan) to align and analyze them. This step returns a csv file for every ORF. - After running this scripts on genes, protogenes and intergenic orfs, I used R/combineSynalOutputs.R script to read all csv files and get a combined dataframe - This dataframe is then used as input for R/clustering_pythondata.R script and generated hierarchical clustering dendrogram with heatmap

Since this analysis was run on a cluster for multiple days, it is not included in the makefile. Though you can inspect the SynORFan scripts from the github repo and check the R/clustering_pythondata.R

Subnetwork plots

R/subgraph_plots_cytoscape.R was used for figure xx and figure xx. You need RCy3 and cytoscape installed for reproduction of those figures. Since these are interactive functions, they are not included in makefile.



oacar/pgsNetwork documentation built on Oct. 1, 2019, 9:15 a.m.