buildindex: Build index for a reference genome

Description Usage Arguments Details Value Author(s) References Examples

View source: R/buildindex.R


An index needs to be built before read mapping can be performed. This function creates a hash table index for the reference genome, which can then be used by Subread and Subjunc aligners for read alignment.





character string giving the basename of created index files.


charater string giving the name of a FASTA-format file that includes sequences of all chromosomes and contigs.


logical indicating if a gapped index or a full index will be built. A gapped index contains 16mers (subreads) that are extracted every three bases from a reference genome, whereas a full index contains subreads extracted from every chromosomal location of a genome. The index contains a hash table, which includes sequences of subreads and their corresponding chromosomal locations. Default value of this argument is FALSE (ie. a full index is built).


logical indicating if an index can be split into multiple blocks. The block size is determined by value of parameter memory. FALSE by default (ie. a single-block index is generated).


numeric value specifying the amount of memory (in megabytes) used for storing the index during read mapping. 8000 MB by default. Note that this option is ignored when indexSplit is FALSE.


numeric value specifying the threshold for removing highly repetitive subreads (16bp mers). 100 by default. Subreads will be excluded from the index if they occur more than threshold number of times in the genome.


logical. If TRUE, a color space index will be built. Otherwise, a base space index will be built.


This function generates a hash-table index for a reference genome, in which keys are subreads (16mers) and values are their chromosomal locations in the reference genome. The built index can then be used by Subread (align) and subjunc aligners to map reads(Liao et al. 2013).

Highly repetitive subreads (or uninformative subreads) are excluded from the hash table so as to reduce mapping ambiguity. TH_subread specifies the maximal number of times a subread is allowed to occur in the reference genome to be included in hash table.

When an index index is split into multiple blocks, only one block will be loaded into memory at any time during read mapping. Therefore the size of memory used during mapping can be controlled. The more memory is used in mapping, the faster the mapping speed. Generating a one-block full index (eg. setting both gappedIndex and indexSplit to FALSE) will enable the maximum mapping speed to be achieved.

Once an index is built, it can be re-used in each mapping.

Sequences of reference genomes can be downloaded from public databases. For instance, primary assembly of human genome GRCh38/hg38 or mouse genome GRCm38/mm10 can be downloaded from the GENCODE database via the following links:


No value is produced but index files are written to the current working directory.


Wei Shi and Yang Liao


Yang Liao, Gordon K Smyth and Wei Shi. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108, 2013.


# Build an index for the artifical sequence included in file 'reference.fa'
ref <- system.file("extdata","reference.fa",package="Rsubread")

Rsubread documentation built on Nov. 9, 2018, 6 p.m.