annotateBy: Fast Gene Annotation by Alignment or Chromosomal Location
In ProbeAnnotator: Faste Gene Annotation in R

Description Usage Arguments Value Reference Sequence Region Input Files Location Types Ouput Separators Examples

These method allows to perform fast probe-to-gene annotation using 1) chromosomal location, or 2) alignment files from softwares such "bowtie", "bowtie2" or "gmap".

annotateByAlignment(file1, file2, alignment.columnsIndex, sepFile2, minScore,
  refDownStream, refUpStream, probesetSep, txDb, mapType = "EXONINTRON",
  promotorRange = 1500, extendedRange = 2000, orgDb, orgDb_Columns,
  sep_intra = ";", sep_inter = "//", verbose = FALSE)

annotateByLocation(x, txDb, mapType, promotorRange = 1500,
  extendedRange = 2000, orgDb, orgDb_Columns, sep_intra = ";",
  sep_inter = "\\", verbose = FALSE)

`file1`	A character vector, the name of the probes's fasta file. See Input Files.
`file2`	A character vector, the name of the probes-to-reference SAM file. See Input Files.
`alignment.columnsIndex`	A numeric vector, containing the index of the score, probe's name, reference's name and alignment offset in the alignment file `file2`. This argument is not used if `alignment.method` is declared.
`sepFile2`	A character vector, the string column separator un `file2`.See Input Files.
`minScore`	A numeric value, giving the minimum allowed alignment score.
`refDownStream`	A numeric value, giving the number of downstream bp in the reference, see Reference Sequence Region.
`refUpStream`	A numeric value, giving the number of upstream bp in the reference, see Reference Sequence Region.
`probesetSep`	A character vector, indicating the probeset seperator if probes are organised in sets.
`txDb`	A `TranscriptDb` object, giving the genomic references for the alignement. If `txDb` is missing, the default `TxDb.Hsapiens.UCSC.hg19.knownGene` will be used.
`mapType`	A character vector representing the probe-to-gene mapping type. This must be one of `"EXONINTRON"`, `"NO_EXONINTRON"` or `"EXON"`. Any unambiguous substring can be given. See Reference Sequence Region.
`promotorRange`	A integer vector, giving the window size for the genes' promotor site in bp. Default is `1500`, see Location Types.
`extendedRange`	A integer vector, giving the window size for the genes' extended site in bp. Default is `2000`, see Location Types.
`orgDb`	A `OrganismDb` object, giving the details of the genomic references. If `orgDb` is missing, the default `org.Hs.eg.db` will be used.
`orgDb_Columns`	A character vector (optional), giving which columns to extract from the `orgDb`. Note that if `orgDb_Columns` is used, then the user selected columns will be selected instead of the default request on `org.Hs.eg.db`, see details section.
`sep_intra`	A character vector, giving the separator character for gene information, see Ouput Separators.
`sep_inter`	A character vector, giving the separator character between genes, see Ouput Separators.
`verbose`	A logical value, indicating if messages should be printed. Default is `FALSE`.
`x`	A `GRanges` object or `data.frame`, giving the coordinates to annotate.

The method annotateByAlignment returns a data.frame object, with one record (or row) for each probes given in x.

With the default organism database, this data.frame contains the following information:

Column	Comment
`probe_name`	xxxx
`entrezid`	xxxx
`chr`	xxxx
`strand`	xxxx
`loctype`	xxxx
`gene_end`	xxxx
`gene_start`	xxxx
`gene_symbol`	xxxx
`gene_alias`	xxxx
`gene_name`	xxxx

If the user supplies its own organism database (orgDb and orgDb_Columns), the function will return a equivalent data.frame as above, with the columns gene_symbol, gene_symbol and gene_symbol replaced by the ones provided in orgDb_Columns.

Genomic level:

it corresponds to the gene reference format used for alignment and is controled with the mapType argument. It determines the loctype in the annotation. There are three types of allowed mapType:

"NO_EXONINTRON": if the reference sequence contains one sequence per gene.

The available loctype are: "gene","promotor","extended","intragenic".
"EXONINTRON": if the reference contains one sequence per transcript, including introns.

The available loctype are: "gene","intron","exon","promotor","extended","intragenic".
"EXON": if the reference contains one sequence per transcript, without introns.

The available "gene","exon","promotor","extended","intragenic".

Upstream and downstream.

The upstream and downstream values that are retrieved in the reference are controled with the refUpStream and refDownStream parameters (in bp).

Probes' FASTA file:: this file's name is given by the file1 argument. It is used to retrieve all of the platform's probe names.

#'

Probeset:

when the probes in the platform are arranged in probesets, one can use the probesetSep to define the probesets seperator string.

For example, using Affymetrix's XXX platform, set probesetSep="at.".

Alignement output:

this file's name is given by the file2 argument. Those outputs must have columns the alignment score, probe's name, reference's name and alignment offset (see Alignment format below).

Alignment format:

the alignment format must be known to this function to get the alignment infomration (score, probe, ref ,offset). The default input is the SAM format (see specifications at https://samtools.github.io/hts-specs/SAMv1.pdf, however it can be achieved manualy using the alignment.columnsIndex and the sepFile2 arguments.

alignment.columnsIndex and the sepFile2 allows user to enter specific alignment ouput format. The alignment.columnsIndex must indicate the columns of score, probe's name, reference's name and alignment offset. The column separator is given with argument sepFile2.

The location types (column loctype) are pre-defined regions that describe gene's region to which the probe match to. There are six types or loctype (shown in table below).

`loctype`	illustration
`"gene"`	`.............########################.............`
`"intron"`	`................*.....**..................`
`"exon"`	`.............===...=====......==..===.............`
`"promotor"`	`..........<+++++>.................................`
`"extended"`	`.......<~~~~>........................<~~~~>.......`
`"intragenic"`	`------>....................................<------`

Promotor and extended regions:

they can be adjusted using the promotorRange and extendedRange parameters. The promotor's range is set at +/- promotorRange bp from the gene's start location. The extended's ranges are located at both ends of the gene, extending the gene region by extendedRange bp.

To exclude the "promotor" and/or "extended" regions for the annotation, set promotorRange=0 and/or extendedRange=0.

sep_intra

controls the columns' elements concatenation in a unique reference (i.e. genes). If sep_intra=";" then gene items that have multiple entries are concatenated with ";". For example:

EGFR gene has six other symbols (ERBB, HER1, mENA, ERBB1, PIG61 and NISBD2), the "alias" column will be:

EGFR;ERBB;HER1;mENA;ERBB1;PIG61;NISBD2
HOXA10 gene has four other symbols (HOX1, HOX1.8, HOX1H and PL), the "alias" column will be:

HOXA10;HOX1;HOX1.8;HOX1H;PL

sep_inter

controls the columns' elements concatenation when a probe is mapped to multiple references (i.e. genes).

For example assume that a probe is mapped both to HOXA10 and EGFR, then all columns containing gene information are concatenated with sep_inter="//". Here:

symbol alias

HOXA10//EGFR HOXA10;HOX1;HOX1.8;HOX1H;PL//EGFR;ERBB;HER1;mENA;ERBB1;PIG61;NISBD2

## .. todo
## Not run: 
## Example of 3 coordinates of the MGMT gene on chromosome 10
## - coordinates: chr10:27132612-2713562
## - assembly: hg19 (default in annotateByLocation)
start = c(131263500, 131264960, 131265460)
probeID = sprintf("probe_%d",1:3)

## Using x in GRanges format
gr = GRanges(seqnames = "chr10",
             strand = "+",
             ranges = IRanges(start = start,
                              width = 20),
             ID = probeID)
annot_gr = annotateByLocation(x = gr, mapType = "EXON")

df = data.frame(chr = "chr10",
                strand = "+",
                start = start,
                end = start+20,
                ID = probeID)
annot_df = annotateByLocation(x = df, mapType = "EXON")

## Check if both results are the same
all.equal(annot_gr, annot_df)

print(annot_gr)

## End(Not run)