README.md

Post processes output from the star salmon pipeline.

Depricated. Renamed and moved to PostRNASeqAlign

Example

post_process_salmon(
  this_script_path = housekeeping::get_script_dir_path(include_file_name = T),
  input_file_paths = <input_file_paths>,
  output_dir = <output_dir>,
  ref = "grch38",
  gene_biotypes = c('protein_coding', 
                    'IG_C_gene','IG_D_gene', 'IG_J_gene', 'IG_V_gene',
                    'TR_C_gene', 'TR_D_gene', 'TR_J_gene','TR_V_gene'),
  thread_num = 8,
  output_transcript_matrix = F,
  output_hgnc_matrix = T,
  output_entrez_id_matrix = F,
  output_piped_hugo_entrez_id_matrix = F,
  output_upper_quartile_norm = F,
  output_log2_upper_quartile_norm = F,
  counts_or_tpm = "counts"
)

Assembling this package

In R:

housekeeping::assemble_package(package_name = "StarSalmon", my_version = "0.2-02",
  my_dir = "/datastore/alldata/shiny-server/rstudio-common/dbortone/packages/StarSalmon")

Push changes

In bash:

cd /datastore/alldata/shiny-server/rstudio-common/dbortone/packages/StarSalmon
my_comment="Forgot to rebuild 0.2-01"
git add .
git commit -am "$my_comment"; git push origin master
git tag -a 0.2-02 -m "$my_comment"; git push -u origin --tags

Install

Restart R In R (local library, packrat library):

devtools::install_github("Benjamin-Vincent-Lab/StarSalmon")

Or for a specific version:

devtools::install_github("Benjamin-Vincent-Lab/StarSalmon", ref = "0.2-02")

Previous locations

https://sc.unc.edu/dbortone/starsalmon https://sc.unc.edu/benjamin-vincent-lab/starsalmon Moved to github so that the package could be accessed without a token.

Creating BM_results

The following code was used to make the grch38 bm_results

mart = useMart(biomart="ENSEMBL_MART_ENSEMBL",
                  dataset="hsapiens_gene_ensembl", 
                  host="useast.ensembl.org") # uk, useast, uswest, asia was intermitant, www was hardley ever working
#https://useast.ensembl.org/info/website/archives/index.html
# other hosts failed to find the data !@$!#$
# failed submisions
# failures mid submission

# datasets <- biomaRt::listDatasets(ensembl)
unique_names = my_dt[[1]]
# my_filters = listFilters(mart)
# my_attr = listAttributes(mart)
# ucsc  |  UCSC Stable ID(s) [e.g. ENST00000000233.9] # these aren't ucsc!?!
# had to run this a ton to get it to go all the way through.  was in the process of breaking it up into a loop when it ran through so I saved it.
failed = FALSE
BM_results = tryCatch({
  biomaRt::getBM(
    filters= "ucsc",
    attributes= c("ucsc", "hgnc_symbol", "entrezgene_id", "gene_biotype"),
    values= unique_names,
    mart= mart
  )
}, warning = function(w) {
  message("Got warning")
  failed = TRUE
}, error = function(e) {
  message("Got error")
  failed = TRUE
})

I renamed the 'entrezgene_id' to 'entrezgene.' Connecting with the dataabse above was very problematic. It failed to connect 1 out of 5 times and when it did connnect it didn't finish. I was giving up on it and was going to write a loop to keep sending smaller batches using the try catch statement when finally the whole thing went through. For future uses a loop is the way to go. Also don't expect the column names to stay stable. They change these on almost a monthly basis. I'd love to switch to something other than biomaRt, but unfortunately AFAIK there isn't anything else.

Comments

Using test_code.R, I checked if hgnc or entrez was better for not having one ensemble map to them. Almost all of the duplicates for the ensembl id were from one ensembl id mapping to multiple entrez. Very few of the hgnc caused multimappings. I also checked if using entrez to lookup hgnc and visa versa caused more multimappings. It wasn't a huge contributor and looking up the genes did find a lot fo new genes: ~70 for hgnc and ~300 for entrez. I skiped using tximport. I don't see the value of this package, since I'd have to make the tx2gene matrix anyway. That's the hard part.

See inst/mapping_genes.R for the addition of another 1882 genes using AnnotationDbi on 20200805.

ToDos



Benjamin-Vincent-Lab/StarSalmon documentation built on Feb. 15, 2024, 11:34 p.m.