prepare_anno: Download and clean ref and prepare anno

View source: R/prepare_anno.R

prepare_annoR Documentation

Download and clean ref and prepare anno

Description

The goal of this function is to download the reference fasta file for a specific release of Ensembl or Gencode. The reference is then cleaned. We keep only the transcript id and we remove the transcript version by default. It is also possible to add ERCC92 sequences. Reference files without the alternative chromosomes and only with the protein coding are also generated.

Usage

prepare_anno(
  org,
  db = "Ensembl",
  release = NA,
  ERCC92 = FALSE,
  force_download = FALSE,
  gtf = FALSE,
  outdir = "."
)

Arguments

org

The organism name. Currently accepted: * Homo sapiens (Ensembl and Gencode) * Mus musculus (Ensembl and Gencode) * Macaca mulatta (Ensembl only) * Rattus norvegicus (Ensembl only) * Bos taurus (Ensembl only)

db

The database to use: Ensembl or Gencode. Default: "Ensembl"

release

The version of the database to use. Must be greater than 100 for Ensembl, 35 for Gencode Homo sapiens and 25 for Gencode Mus musculus. Default: NA

ERCC92

Add ERCC92 sequence to reference and to anno? Default: FALSE

force_download

Re-download raw reference if it is already present? Default: FALSE

gtf

Download the annotation corresponding to the fasta in gtf format? Default: FALSE

outdir

Directory in which to save the files. Default : "."

Details

#' After calling this function, a <prefix>.raw_ref.fa.gz file will be downloaded (if not already present) to the current working directory that corresponds to the raw reference file. There will also be a clean version, without alternative chromosomes in the format <prefix>.no_alt_chr.fa.gz. A <prefix>.protein_coding.fa.gz file is also generated, containing only the protein_coding genes. Finally, for all 3 fa.gz files, a <prefix>.csv file is created. The csv file contains the annotation formated correctly for the rnaseq packages. Finally, a <prefix>.info file is created. This file contains metadata about every file and the parameters used.

The <prefix>.info file contains the following columns: * prefix: The prefix of the file. Must match filename (i.e.: prefix of Hs.Gencode38.csv is Hs.Gencode38). * org: The organism name (i.e.: Homo sapiens) * db: Database where the annotation was downloaded. * release: The version of the database. * ERCC92: The value of the ERCC92 argument. * anno_pkg_version: The anno package version. * download_date: The date the annotation was downloaded. * download_url: The URL that was used to download the annotation. * A md5sum for every file generated, one column per file.

Value

Returns a list including every information in the <prefix>.info file.

Examples

## Not run: 
  prepare_anno("Hs.Ensembl103", org = "Homo sapiens", db = "Ensembl",
               release = 103)

## End(Not run)


CharlesJB/anno documentation built on Feb. 1, 2023, 6:31 a.m.