Notes on identifiers

Chromosome names

: Chromosomes are named with a prefixed "chr" and then the chromosome number or letter. Chromosome with number names (1..22) are specified as 1 or 2 digits with no leading 0. Chromosomes with letter names are specified with a capitalized letter (M, X, or Y). The exception is the mitichondiral chromosome, which may be labeled chrM_rCRS instead of chrM due to issues with the promulgated genome. There should never be mixed names for the mitochondrial chromosome in the same file; it should be one or the other. Case matters, CHRx, chrx, and ChrX are invalid names.

    chr1, chr2, chr10, chr22, chrM/chrM_rCRS, chrX, chrY

Where sorting matters, the order is as given above. chr10 comes just after
chr9, and chrM comes just after chr22. Note that this requires special
handling to sort correctly. Neither normal alphabetic nor numeric sorting
will work.

Feature locations

: Features are specified by a location on the genome. This location is usually given as a range of positions on one of the two strands of a chromosome. The positions are specified as if the feature was on the '+' (5' to 3') strand, where the first base of the chromosome starts at 1 and the given positions represent the first and last base of the feature (inclusively). Features on the '-' (3' to 5') strand run in the reverse direction, so the given first position is really the last base of the feature and the given last position is actually the first base of the feature.

Strands may be specified as +, -, \*, or left unspecified. A \* means the 
strand a unknown. If no strand is specified, a feature is assumed to be
either on the + strand or unknwon, as required by the context. If a sequence
of bases is explicitly given, this is usually the sequence on the + strand.
For features on the - strand, the sequence of that feature is the reverse 
complement of the specified sequence, unless explicitly stated otherwise.

Fusion locations

: Fusion locations are specified as if both ends occured on the plus strand, with the first position being the last base of the low end of the fusion and the second position the first base of the high end of the fusion. If fusion ends are on different chromsosomes, the low end of the fusion is the lower sorting chromosme, as described under chromosome names, above. Strandedness is currently ignored by this package. [TODO: Allow stranded data.]

Fusion junctions provide a weak kind of strandedness information, even when
no strandedness was present in the sequencing. Reads spanning the junction
will either map in the same direction on both sides of the fusion (either
+/+ or -/-), or they will map in different directions on both sides
(either +/- or -/+). Junctions are actuality in one of these 4 possible
states, but only 2 states, same or different, can be determinined in the
absence of strandedness information. Strandedness is represented as +/+ or
+/- respectively. This pseudo-strandedness is currently ignored by this
package. [TODO: Allow for pseudo-stranded data.]

Fusion junctions in RNASeq data usually occur at exon boundaries as a result
of fusion points occuring in introns or creating new splicing patterns.
This means that fusions have a directionality given the assumed role of the
bases at the fusion ends in splicing, with a "donor" and an "acceptor" side.
For fusion points within an "exon", this directionality is not meaningful
as any base may be on either side of the fusion point in the processed
transcript, but it is not possible to tell from the RNASeq data alone when
this has occured. This is represented as A/D or D/A. This orientation is
currently ignored by this package. [TODO: Allow for oriented fusions.]

Sample names

: Sample names are potentially used in file or directory names, and hence should be limited to only letters, digits, periods, underscores, and hyphens. Sample names should be unique within any cohort or project, and although matched case sensitively, names should be unique when considered in a case-insensitive way to prevent collisions on case-insensitive file systems.

Column headings

: To allow for future additions to data files, column headings are used to identify which column contains data needed. Other columns will be ignored, but the ignored column headings must be parseable. To maximum portability, all column headings, even those ignored, should be assumed to be case sensitive but should be unique within a file when considered case insensitively. Any characters other than letters, digits, periods, underscores, and hyphens should be avoided in column header names - especially spaces. Tabs and returns are not allowed under any circumstances as these are file delimiters.

File names

: File names should in general be full path file names, be considered to be matched case-sensitively, should be unique if considered case insensitively, and should contain only letters, digits, periods, underscores, and hyphens. If a file name is a link or shortcut, it should be followed transparently, so if there are problematic file names, create links to them and use the full-path to the link.

Gene, transcript and exon names

: All gene, transcript, and exon names are taken from the genome annotation file (gaf). There are multiple versions of the gaf, this package uses v2.1.

Required input data

Overview

The following data files are needed by this package. Each is described in more detail in a following section.

This package was initially designed to run as part of a batch processing pipeline. Inputs are based on files rather than variables. [TODO - allow input from vectors, lists, or data-frames in addition to file input.]

Gaf 2.1 (TCGA.hg19.June2011.gaf)

: The gene annotation file (gaf) reference used with the TCGA RNASeq data as uploaded by UNC. Contains descriptions including names and location models for all the genes, transcipts, exons and more. Used as a reference for the exons described in the Exon expression files and to associate gene or transcript names with exon models.

Exon expression files (bt.exon_quantification.txt)

: Each exon expression file describe RNA expression for one sample on a per-exon basis. It is normally downloaded from the TCGA website and is publically available after a short embargo period. It is usally called bt.exon_quantification.txt. The directory it is in is the name of the sample it is associated with.

Sample description file

: This file lists samples and their properties. The only required property is the path to the exon expression file. Extra properties are ignored

Fusion file

: This file lists the fusions in the cohort, by sample. Each line represents a fusion between two positions in the genome in a sample, as indicated by the most probable alignment of one or more RNASeq reads. Fusions as they appear in RNASeq data do not differ much from a long range splice event, except it can involve different chromosomes.

Each fusion end may match to multiple partners, and each such fusion event is

indicated on a separate line. Multiple RNA fusion parnters may or may not indicate multiple fusion events. This can be due to heterogeneity in the underlying DNA fusion events in the source sample, but can also be the result of alternate splicing of transcripts whose unprocessed sequence spans a particular DNA fusion.

Gene annotation file V2.1 (gaf)

[TODO: Fix out of date Links]

This package only works with version 2 of the gaf, specifically the hg19 version from June 2011. This annotation data was used by UNC-LCCC to generate the TCGA RNASeq expression results (V2 - genome alignment based). No dbSNP information is used, so the smaller version of the gaf that does not include the SNP annotations is fine to use. It can be downloaded from the NCI here. You probably want TCGA.hg19.June2011.gaf.gz. The description of this file is unfortunately only included with version 2.0 of the gaf, here. GAF_hg19_Feb2011.docx describes how the gaf was put together and what the conventions are. GAF_v2_file_description.docx describes the columns. The changes relative to version 2.0 are described here.

Exon expression files

The exon expression file format is one sample per file, one exon per line, no heading. Each line is a tab-delimited description in 4 columns of the expression for one (non-overlapping) exon from the sample. Exons are pre-defined by the gaf.

column 1 - exon id

: Each line is a different exon identified by a formated string giving the genomic coordinates of the exon, formated like

    <chr#>:<start>-<end>:<strand>

    <chr#>   - The chromosome name
    <start>  - The lowest genomic position of the exon.
    <end>    - The highest genomic position of the exon.
    <strand> - The orientation of the exon; the strand it is from.

column 2 - count

: The number of reads whose alignment overlaps this exon.

column 3 - percent coverage.

: The fraction of bases in the exon covered by aligned reads.

column 4 - RPKM.

: RPKM = (count) * 1e9 / ((total reads) * (exon length)) where total reads are the number of aligned reads in this sample. Total reads can be calculated from any given line by reversing this equation for a given RPKM; it is constant within a file.

An example file:

    chr1:11874-12227:+  0  0.0000000    0
    chr1:12595-12721:+  2   0.3809524   0.185006777804708
    chr1:12613-12721:+  2   0.2777778   0.215558355790806
    chr1:12646-12697:+  0   0.0000000   0
    chr1:13221-14409:+  31  0.1843434   0.306295914304935
    chr1:13403-14409:+  30  0.2127237   0.349987995747734
    chr1:14363-16765:-  3039    0.9929225   14.8572453004703
    ...
    chrM_rCRS:12337-14644:+ 740004  1.0000000   3766.68781662252
    chrM_rCRS:14761-15887:+ 705755  1.0000000   7356.84171501079
    chrM_rCRS:15998-16182:+ 7884    1.0000000   500.652341618822
    chrM_rCRS:16183-16569:+ 7809    1.0000000   237.053200052163

Calculating total reads from chr1:12595-12721:+ looks like:

    (count * 1E9) / ( RPKM             *  (exon_length)     ) =
    ( 2    * 1E9) / (0.185006777804708 * (1 + 12721 - 12595)) =
    85121376 = total reads

Calculating RPKM for chr1:14363-16765:- looks like:

    (count) * 1e9 / ((total reads) *  (exon length)     ) =
     3039   * 1e9 / (  85,121,376  * (1 + 16765 - 14363)) =
     14.8572453004703... = RPKM

Note that being on the forward or reverse strand has no impact on values or calculations.

Public TCGA data is packaged by cancer type and then within that disease by a hierarchy describing the analysis platform. The details of this organization is somewhat obscure, it is described [TODO][here]. The root for this public data is

https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/

[TODO: Fix out of date Link]

One subdirectory for each disease type is provided, and then within that directory, the path to the RNASeq expression data will look like

Within this directory are multiple versions of the cohort, the most recent one will be the most complete.

Note: for some cancer types, older RNAseq analysis platforms are also available. This data has been superceeded by the platform described above. For some cancer types, total RNASeq expression data is available. Total RNASeq still tries to filter out ribosomal RNA but includes non-polyA-tailed RNA. For a few cancer types, additional or independent RNASeq data will be under different providers (not unc.edu)

The sample description file

The sample description file format is one sample per line, with a header describing the columns. Lines and headers are tab delimited and terminated by newlines. At least one column, "sample", must contain the sample names and another, "exonExpressionFile" the path to the exon expression files for those samples. Other columns are ignored.

sample

: This column gives the sample name, and will be used as a key and possibly a file or directory name. Therefore the strings in this column should be unique when considered case in-sensitively and consist of only letters, numbers, periods, underscores, and hyphens.

exonExpressionFile

: The path to the exon expression file associated with this sample. Generally this is the full path. [TODO: test local paths] URLs are not supported. [TODO: support urls]

An example file looks like:

    sample ExonExpressionFile
    TCGA-QR-A6H4-01A-11R-A35K-07    /extData/expression/140205_UNC12-SN629_0347_AC38LPACXX_7_TGACCA/bt.exon_quantification.txt
    TCGA-QR-A705-01A-11R-A35K-07    /extData/expression/140205_UNC12-SN629_0347_AC38LPACXX_7_GAGTGG/bt.exon_quantification.txt

The fusion data file

The fusion data file describes the RNASeq derived fusions associated with the samples listed in the sample file. It is formatted with one fusion per line and a header line describing the columns. Fusion lines and the headers are tab delimited and terminated by newlines. The following columns must be provided. Other columns are ignored.

sample

: The sample name. Must match the sample name from the cohort definition file exactly (case sensitive).

fusion

: The fusion, formatted like:

     <chrLow>:<posLow><strandLow>~<chrHigh>:<posHigh><strandHigh>

     <chr*> - A chromosome name; chr1 to chr22, chrX, chrY, or chrM_rCRS.
     <pos*> - The last base of a fusion end in + strand coordinates.
     <strand*> - The orientation of an exon; its strand (+,-,*).

     Low and High provide a canonical order:
         If chr* are different, the first in order is Low
         If chr* are the same, then the lower of pos* (+ strand) is Low
         If chr* and pos* are the same, strandLow is +, then -, then *

Set up to run

Assumes working on a Linux system

(1) Set up a working project directory.

    mkdir /home/srj/Projects/Fusions/Pcpg/FusionExpression
    cd /home/srj/Projects/Fusions/Pcpg/FusionExpression

(2) Create the cohort definition file maping samples to quantification files. We have a postgresql database tracking data. [TODO - replace with GDC info...]

    psql -U seqware -h swprod.bioinf.unc.edu seqware_meta_db \
         --field-separator $'\t' --no-align --tuples-only \
         --command "select v.sample, v.file_path \
              from vw_files v, upload u \
              where u.sample_id = v.sample_id and \
                  v.sample_type = 'PCPG' and \
                  v.algorithm = 'bt_exon_quantification' and \
                  (u.target='CGHUB' or u.target='CGHUB_BAM') and \
                  u.external_status = 'live' \
                  and (u.status != 'skip')" \
          -o exonQuantFileList.txt

(3) Verify no sample listed twice:

    cut -f 1 exonQuantFileList.txt | sort | uniq | wc -l
    wc -l exonQuantFileList.txt

(4) Get list of samples in cohort and edit to contain only the samples you want plots for.

    cut -f 1 exonQuantFileList.txt > sampleList.txt

(6) Add heading to exonQuantFileList.txt (tab separated first line)

    sample  exonExpressionFile

(7) Create list of genes to plot (genesList.txt) with one gene per line.

(8) Get Mapsplice cohort fusion list and save locally

Get gene model

(1) Link to gaf file

    ln -s /datastore/tier1data/nextgenseq/seqware-analysis/GAF/TCGA.hg19.June2011.gaf TCGA.hg19.June2011.gaf

(2) In R with this as the working directory

1. Load code

        library( "FusionExpressionPlot" )

2. Set up parameters for running

     One convention used below is to save useful intermediate results as
     .RDS files. This incudes the results of computations that takes a
     while and results that are reasonably independent data objects that
     are used as a parameter elsewhere. These are saved as .RDS files.
     Another convention is to use variables to hold filenames.

        gafFile <- "TCGA.hg19.June2011.gaf"
        gafGeneModelsFile <- "TCGA.hg19.June2011.gaf.geneModels"
        geneModelsDF.RDS <- "geneModelsDF.RDS"

3. Extract gene models from the gaf (takes a few seconds)

        gafGeneModelStats <- extractGeneModels( gafFile, outFile= gafGeneModelsFile )
        print(gafGeneModelStats)
            $gaf
            [1] "TCGA.hg19.June2011.gaf"

            $gaf_real
            [1] "/datastore/tier1data/nextgenseq/seqware-analysis/GAF/TCGA.hg19.June2011.gaf"

            $gaf_md5
                        TCGA.hg19.June2011.gaf
            "b9e0c2b81736d82d62bb6ab8cc517644"

            $uniqueGene
            [1] TRUE

            $skipUnknownGene
            [1] TRUE

            $gaf_lines
            [1] 3483704

            $gaf_models
            [1] 20764

            $gaf_models_unique
            [1] 20502

            $gaf_extract
            [1] "TCGA.hg19.June2011.gaf.geneModels.txt"

            $gaf_extract_real
            [1] "/home/srj/Projects/Fusions/Pcpg/FusionExpression/TCGA.hg19.June2011.gaf.geneModels.txt"

            $gaf_extract_md5
            TCGA.hg19.June2011.gaf.geneModels.txt
               "4e048d450089ee2309dd16c6ce19b898"

  4. Generate the data file of gene models (takes a few seconds)

        geneModelsDF <- loadGafModels( gafGeneModelsFile )
        saveRDS( geneModelsDF, file=geneModelsDF.RDS )

Generate cohort expression data frame

  1. Setup parameters for running

    expressionColumn <- "rpkm"
    geneModelsDF.RDS <- "geneModelsDF.RDS"
    cohortSamplesFile <- "sampleList.txt"
    sampleExpressionFileMap <- "exonQuantFileList.txt"
    cohortExpressionDF.RDS <- "cohortExpressionDF.RDS"
    
  2. Load the gene model data prepared above

    geneModelsDF <- readRDS( geneModelsDF.RDS )
    
  3. Set up the data formats used for processing gene expression files

    cohortSamples <- slurp( cohortSamplesFile )
    sampleExpressionFilesDF <- loadCohortDefinition( sampleExpressionFileMap, cohortSamples )
    
  4. Load data from all cohort expression files, takes about 5 seconds per file

     cohortExpressionDF <- getCohortExonExpressionData( geneModelsDF, sampleExpressionFilesDF, type=expressionColumn )
        [1] "1 of 184. Adding exon expression data for TCGA-P7-A5NY-01A-12R-A35K-07"
        [1] "2 of 184. Adding exon expression data for TCGA-QR-A6H4-01A-11R-A35K-07"
        ...
        [1] "183 of 184. Adding exon expression data for TCGA-WB-A81K-01A-11R-A35L-07"
        [1] "184 of 184. Adding exon expression data for TCGA-S7-A7X0-01A-12R-A35L-07"
     saveRDS( cohortExpressionDF, file= cohortExpressionDF.RDS )
    

Normalize

  1. Set up for normalization

    cohortExpressionDF.RDS <- "cohortExpressionDF.RDS"
    normalizedCohortExpressionDF.RDS <- "normalizedCohortExpressionDF.RDS"
    
  2. Remove any extraneous exons

    dropRows <- cohortExpressionDF$gene == 'TCF4' & cohortExpressionDF$exon == 10
    cohortExpressionModDF <- cohortExpressionDF[! dropRows, ]
    nrow(cohortExpressionDF)
    #=> [1] 224227
    nrow(cohortExpressionModDF)
    #=> [1] 224226
    saveRDS( cohortExpressionModDF, file= "cohortExpressionModDF.RDS" )
    
  3. Do normalization (Takes a couple of seconds per sample)

    normalizedCohortExpressionDF <- normExpressionData( cohortExpression )
    saveRDS( normalizedCohortExpression, file=normalizedCohortExpression.RDS )
    

Select Fusions

  1. Set up for fusion selection

    fusionFile <- "pcpg_185_2014-05-18.merged.fusion_gene_profiles.txt"
    samplesFile <- "sampleList.txt"
    genesFile <- "genesList.txt"
    fusionDataDF.RDS <- "fusionDataDF.RDS"
    selectedFusionsDF.RDS <- "selectedFusionsDF.RDS"
    
  2. Generate data frame of with all fusions

    fusionDataDF <- getMapSpliceCohortFusionData( fusionFile );
    saveRDS( fusionDataDF, file=fusionDataDF.RDS )
    
  3. Generate the fusion subset data frame

    samples <- slurp( samplesFile )
    genesOfInterest <- slurp( genesFile )
    selectedFusionsDF <- filterFusions( fusionDataDF, sample=samples, gene1=genesOfInterest, gene2=genesOfInterest );
    saveRDS( selectedFusionsDF, file=selectedFusionsDF.RDS )
    

Make plots

In the section we will finally generate the plots. This needs two of the data frames generated above: the fusions to plot created in "Select fusions" and the combined exon gene models + normalized exon expression data created in "Normalize".

  1. Set up for plotting - Use the data frames if they exist, or if in a different R session load them from the saved RDS files.

    selectedFusionsDF.RDS <- "selectedFusionsDF.RDS"
    normalizedCohortExpressionDF.RDS <- "normalizedCohortExpressionDF.RDS"
    selectedFusionsDF <- readRDS( selectedFusionsDF.RDS )
    normalizedCohortExpressionDF <- readRDS( normalizedCohortExpressionDF.RDS )
    
  2. Generate the plots

    do.allPlots( selectedFusionsDF, normalizedCohortExpressionDF )
    
  3. That's it. You can quit R without saving the environment if you want because you have all the important data saved as RDS files. You probably want to move the plots to a directory and zip them up for storage if you generated a bunch:

  4. [Optional] Move the plots to a directory and compress them

    mkdir myPLots mv *.pdf myPlots/ tar -czvf myPlots.tgz myPlots/



jefferys/FusionExpressionPlot documentation built on May 19, 2019, 3:59 a.m.