Fast Gene Annotation by Alignment or Chromosomal Location

Share:

Description

These method allows to perform fast probe-to-gene annotation using 1) chromosomal location, or 2) alignment files from softwares such "bowtie", "bowtie2" or "gmap".

Usage

1
2
3
4
5
6
7
8
annotateByAlignment(file1, file2, alignment.columnsIndex, sepFile2, minScore,
  refDownStream, refUpStream, probesetSep, txDb, mapType = "EXONINTRON",
  promotorRange = 1500, extendedRange = 2000, orgDb, orgDb_Columns,
  sep_intra = ";", sep_inter = "//", verbose = FALSE)

annotateByLocation(x, txDb, mapType, promotorRange = 1500,
  extendedRange = 2000, orgDb, orgDb_Columns, sep_intra = ";",
  sep_inter = "\\", verbose = FALSE)

Arguments

file1

A character vector, the name of the probes's fasta file. See Input Files.

file2

A character vector, the name of the probes-to-reference SAM file. See Input Files.

alignment.columnsIndex

A numeric vector, containing the index of the score, probe's name, reference's name and alignment offset in the alignment file file2. This argument is not used if alignment.method is declared.

sepFile2

A character vector, the string column separator un file2.See Input Files.

minScore

A numeric value, giving the minimum allowed alignment score.

refDownStream

A numeric value, giving the number of downstream bp in the reference, see Reference Sequence Region.

refUpStream

A numeric value, giving the number of upstream bp in the reference, see Reference Sequence Region.

probesetSep

A character vector, indicating the probeset seperator if probes are organised in sets.

txDb

A TranscriptDb object, giving the genomic references for the alignement. If txDb is missing, the default TxDb.Hsapiens.UCSC.hg19.knownGene will be used.

mapType

A character vector representing the probe-to-gene mapping type. This must be one of "EXONINTRON", "NO_EXONINTRON" or "EXON". Any unambiguous substring can be given. See Reference Sequence Region.

promotorRange

A integer vector, giving the window size for the genes' promotor site in bp. Default is 1500, see Location Types.

extendedRange

A integer vector, giving the window size for the genes' extended site in bp. Default is 2000, see Location Types.

orgDb

A OrganismDb object, giving the details of the genomic references. If orgDb is missing, the default org.Hs.eg.db will be used.

orgDb_Columns

A character vector (optional), giving which columns to extract from the orgDb. Note that if orgDb_Columns is used, then the user selected columns will be selected instead of the default request on org.Hs.eg.db, see details section.

sep_intra

A character vector, giving the separator character for gene information, see Ouput Separators.

sep_inter

A character vector, giving the separator character between genes, see Ouput Separators.

verbose

A logical value, indicating if messages should be printed. Default is FALSE.

x

A GRanges object or data.frame, giving the coordinates to annotate.

Value

The method annotateByAlignment returns a data.frame object, with one record (or row) for each probes given in x.

With the default organism database, this data.frame contains the following information:

Column Comment
probe_name xxxx
entrezid xxxx
chr xxxx
strand xxxx
loctype xxxx
gene_end xxxx
gene_start xxxx
gene_symbol xxxx
gene_alias xxxx
gene_name xxxx

If the user supplies its own organism database (orgDb and orgDb_Columns), the function will return a equivalent data.frame as above, with the columns gene_symbol, gene_symbol and gene_symbol replaced by the ones provided in orgDb_Columns.

Reference Sequence Region

Genomic level:

it corresponds to the gene reference format used for alignment and is controled with the mapType argument. It determines the loctype in the annotation. There are three types of allowed mapType:

  1. "NO_EXONINTRON": if the reference sequence contains one sequence per gene.

    The available loctype are: "gene","promotor","extended","intragenic".

  2. "EXONINTRON": if the reference contains one sequence per transcript, including introns.

    The available loctype are: "gene","intron","exon","promotor","extended","intragenic".

  3. "EXON": if the reference contains one sequence per transcript, without introns.

    The available "gene","exon","promotor","extended","intragenic".

Upstream and downstream.

The upstream and downstream values that are retrieved in the reference are controled with the refUpStream and refDownStream parameters (in bp).

Input Files

Probes' FASTA file:

this file's name is given by the file1 argument. It is used to retrieve all of the platform's probe names.

#'

Probeset:

when the probes in the platform are arranged in probesets, one can use the probesetSep to define the probesets seperator string.

For example, using Affymetrix's XXX platform, set probesetSep="at.".

Alignement output:

this file's name is given by the file2 argument. Those outputs must have columns the alignment score, probe's name, reference's name and alignment offset (see Alignment format below).

Alignment format:

the alignment format must be known to this function to get the alignment infomration (score, probe, ref ,offset). The default input is the SAM format (see specifications at https://samtools.github.io/hts-specs/SAMv1.pdf, however it can be achieved manualy using the alignment.columnsIndex and the sepFile2 arguments.

alignment.columnsIndex and the sepFile2 allows user to enter specific alignment ouput format. The alignment.columnsIndex must indicate the columns of score, probe's name, reference's name and alignment offset. The column separator is given with argument sepFile2.

Location Types

The location types (column loctype) are pre-defined regions that describe gene's region to which the probe match to. There are six types or loctype (shown in table below).

loctype illustration
"gene" .............########################.............
"intron" ................***.....******..**................
"exon" .............===...=====......==..===.............
"promotor" ..........<+++++>.................................
"extended" .......<~~~~>........................<~~~~>.......
"intragenic" ------>....................................<------
Promotor and extended regions:

they can be adjusted using the promotorRange and extendedRange parameters. The promotor's range is set at +/- promotorRange bp from the gene's start location. The extended's ranges are located at both ends of the gene, extending the gene region by extendedRange bp.

To exclude the "promotor" and/or "extended" regions for the annotation, set promotorRange=0 and/or extendedRange=0.

Ouput Separators

sep_intra

controls the columns' elements concatenation in a unique reference (i.e. genes). If sep_intra=";" then gene items that have multiple entries are concatenated with ";". For example:

  • EGFR gene has six other symbols (ERBB, HER1, mENA, ERBB1, PIG61 and NISBD2), the "alias" column will be:

    EGFR;ERBB;HER1;mENA;ERBB1;PIG61;NISBD2

  • HOXA10 gene has four other symbols (HOX1, HOX1.8, HOX1H and PL), the "alias" column will be:

    HOXA10;HOX1;HOX1.8;HOX1H;PL

sep_inter

controls the columns' elements concatenation when a probe is mapped to multiple references (i.e. genes).

For example assume that a probe is mapped both to HOXA10 and EGFR, then all columns containing gene information are concatenated with sep_inter="//". Here:

symbol alias

HOXA10//EGFR HOXA10;HOX1;HOX1.8;HOX1H;PL//EGFR;ERBB;HER1;mENA;ERBB1;PIG61;NISBD2

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## .. todo
## Not run: 
## Example of 3 coordinates of the MGMT gene on chromosome 10
## - coordinates: chr10:27132612-2713562
## - assembly: hg19 (default in annotateByLocation)
start = c(131263500, 131264960, 131265460)
probeID = sprintf("probe_%d",1:3)

## Using x in GRanges format
gr = GRanges(seqnames = "chr10",
             strand = "+",
             ranges = IRanges(start = start,
                              width = 20),
             ID = probeID)
annot_gr = annotateByLocation(x = gr, mapType = "EXON")

df = data.frame(chr = "chr10",
                strand = "+",
                start = start,
                end = start+20,
                ID = probeID)
annot_df = annotateByLocation(x = df, mapType = "EXON")

## Check if both results are the same
all.equal(annot_gr, annot_df)

print(annot_gr)

## End(Not run)