read_seqz: Read a seqz or acgt format file

Description Usage Arguments Format Details See Also Examples

Description

Efficiently reads a seqz file into R.

Usage

1
2
3
4
5
6
    read.seqz(file, n_lines = NULL, col_types = "ciciidddcddccc", chr_name = NULL,
        buffer = 33554432, parallel = 1,
        col_names = c("chromosome", "position", "base.ref", "depth.normal",
            "depth.tumor", "depth.ratio", "Af", "Bf", "zygosity.normal",
            "GC.percent", "good.reads", "AB.normal", "AB.tumor",
            "tumor.strand"),...)

Arguments

file

file name

col_types

a string describing the classes of each columns of the input file (see read_tsv). The default value corresponds to the columns of a seqz file.

chr_name

if specified, only the selected chromosome will be extracted instead of the entire file. For tabix-indexed files this argument can also be used to extract coordinated-selected genomic regions. E.g. chr_name="5:1-1000000" will select the first megabase of chromosome 5.

n_lines

vector of length 2 specifying the first and last line to read from the file. If specified, only the selected portion of the file will be used.

buffer

maximal size of each chunk in bytes(see chunk.apply).

parallel

integer, number of threads used to process a seqz file (see chunk.apply).

col_names

names of the columns of the seqz file. The default corresponds to the column names of a seqz file.

...

any arguments accepted by read_tsv.

Format

A seqz file is a tab-separated text file with 14 columns and a header row. The first 3 columns are derived from the original pileup file and contain:

chromosome

the chromosome name

position

the base position

base.ref

the base in the reference genome. Note that this is NOT necessarily the same base as in the normal specimen.

The remaining 10 columns contain the following information:

depth.normal

read depth observed in the normal sample

depth.tumor

read depth observed in the tumor sample

depth.ratio

ratio of depth.tumor and depth.normal

Af

A-allele frequency observed in the tumor sample

Bf

B-allele frequency observed in the tumor sample in heterozygous positions

zygosity.normal

zygosity of the reference sample. "hom" corresponds to AA or BB, whereas "het" corresponds to AB or BA

GC.percent

GC-content (percent), calculated from the reference genome in fixed nucleotide windows

good.reads

number of reads that passed the quality threshold (threshold specified in the pre-processing software), in the tumor specimen

AB.normal

base(s) found in the germline sample; for heterozygous positions AB are sorted using the values of Af and Bf respectively

AB.tumor

base(s) found in the tumor sample not present in the normal specimen. The field include all the variants found in the tumor alignment, separated by a colon. Each variant contains the base and the observed frequency

tumor.strand

frequency of the variant nucleotides detected on the forward orientation. The field have a consistent structure with AB.tumor, indicating the fraction, relative to the total number of reads presenting the specific variant, orientated in the forward direction

Details

read.seqz is a function that allows to efficiently access a seqz file by chromosome or by line numbers. The function can also access coordinate specific regions with tabix-indexed seqz files. The specific content of a seqz file is explained in the value section.

See Also

read_delim.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
   ## Not run: 

    data_file <-  system.file("extdata", "example.seqz.txt.gz", package = "sequenza")

    ## read chromosome 1 from an seqz file.
    seqz_data <- read.seqz(data_file, chr_name = 1)

    ## Fast access to chromosome X using the file metrics
    gc.stats <- gc.sample.stats(data_file)
    chrX <- gc.stats$file.metrics[gc.stats$file.metrics$chr == "X", ]
    seqz.data <- read.seqz(data_file, n_lines = c(chrX$start, chrX$end))

    ## Compare the running time of the two different methods.
    system.time(seqz.data <- read.seqz(data_file, n_lines = c(chrX$start, chrX$end)))
    system.time(seqz.data <- read.seqz(data_file, chr_name = "X"))

   
## End(Not run)

sequenza documentation built on May 9, 2019, 5:04 p.m.