scanTabix: Operations on 'tabix' (indexed, tab-delimited) files.
In Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import

Description Usage Arguments Value Error Author(s) References Examples

Scan compressed, sorted, tabix-indexed, tab-delimited files.

scanTabix(file, ..., param)
## S4 method for signature 'character,IntegerRangesList'
scanTabix(file, ..., param)
## S4 method for signature 'character,GRanges'
scanTabix(file, ..., param)

`file`	The character() file name(s) of the tabix file be processed, or more flexibly an instance of class `TabixFile`.
`param`	A instance of `GRanges` or `IntegerRangesList` providing the sequence names and regions to be parsed.
`...`	Additional arguments, currently ignored.

scanTabix returns a list, with one element per region. Each element of the list is a character vector representing records in the region.

scanTabix signals errors using signalCondition. The following errors are signaled:

scanTabix_param: yieldSize(file) must be NA when more than one range is specified.
scanTabix_io: A read error occured while inputing the tabix file. This might be because the file is corrupt, or of incorrect format (e.g., when path points to a plain text file but index is present, implying that path should be a bgziped file. The error message may include an error code representing the logical OR of these cryptic signals: 1, BGZF_ERR_ZLIB; 2, BGZF_ERR_HEADER; 4, BGZF_ERR_IO; 8, BGZF_ERR_MISUSE.

Martin Morgan <mtmorgan@fhcrc.org>.

http://samtools.sourceforge.net/tabix.shtml

1	example(TabixFile)

Loading required package: GenomeInfoDb
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colMeans, colSums, colnames, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, lengths, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, rank, rbind, rowMeans, rowSums, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which,
    which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit


TabxFl> fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools",
TabxFl+                   mustWork=TRUE)

TabxFl> tbx <- TabixFile(fl)

TabxFl> param <- GRanges(c("chr1", "chr2"), IRanges(c(1, 1), width=100000))

TabxFl> countTabix(tbx)
$<NA>
[1] 237


TabxFl> countTabix(tbx, param=param)
$`chr1:1-100000`
[1] 157

$`chr2:1-100000`
[1] 15


TabxFl> res <- scanTabix(tbx, param=param)

TabxFl> sapply(res, length)
chr1:1-100000 chr2:1-100000 
          157            15 

TabxFl> res[["chr1:1-100000"]][1:2]
[1] "chr1\tENSEMBL\tUTR\t1737\t2090\t.\t+\t.\tgene_id \"ENSG00000223972\"; transcript_id \"ENST00000456328\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"RP11-34P13.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"RP11-34P13.1-201\"; level 3; havana_gene \"OTTHUMG00000000961\";" 
[2] "chr1\tENSEMBL\texon\t1737\t2090\t.\t+\t.\tgene_id \"ENSG00000223972\"; transcript_id \"ENST00000456328\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"RP11-34P13.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"RP11-34P13.1-201\"; level 3; havana_gene \"OTTHUMG00000000961\";"

TabxFl> ## parse to list of data.frame's
TabxFl> dff <- Map(function(elt) {
TabxFl+     read.csv(textConnection(elt), sep="\t", header=FALSE)
TabxFl+ }, res)

TabxFl> dff[["chr1:1-100000"]][1:5,1:8]
    V1      V2         V3   V4   V5 V6 V7 V8
1 chr1 ENSEMBL        UTR 1737 2090  .  +  .
2 chr1 ENSEMBL       exon 1737 2090  .  +  .
3 chr1 ENSEMBL transcript 1737 4275  .  +  .
4 chr1  HAVANA       gene 1737 4275  .  +  .
5 chr1  HAVANA       exon 1873 1920  .  +  .

TabxFl> ## parse 100 records at a time
TabxFl> length(scanTabix(tbx)[[1]]) # total number of records
[1] 237

TabxFl> tbx <- open(TabixFile(fl, yieldSize=100))

TabxFl> while(length(res <- scanTabix(tbx)[[1]]))
TabxFl+    cat("records read:", length(res), "\n")
records read: 100 
records read: 100 
records read: 37 

TabxFl> close(tbx)