analyzePFAM: Import Result of PFAM analysis
In IsoformSwitchAnalyzeR: Identify, Annotate and Visualize Alternative Splicing and Isoform Switches with Functional Consequences from both short- and long-read RNA-seq data.

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/analyze_external_sequence_analysis.R

Allows for easy integration of the result of Pfam (external sequence analysis of protein domains) in the IsoformSwitchAnalyzeR workflow. Please note that due to the 'removeNoncodinORFs' option in analyzeCPAT and analyzeCPC2 we recommend using analyzeCPC2/analyzeCPAT before using analyzePFAM, analyzeNetSurfP2 and analyzeSignalP if you have predicted the ORFs with analyzeORF.

analyzePFAM(
    switchAnalyzeRlist,
    pathToPFAMresultFile,
    showProgress=TRUE,
    quiet=FALSE
)

`switchAnalyzeRlist`	A `switchAnalyzeRlist` object
`pathToPFAMresultFile`	A string indicating the full path to the Pfam result file(s). If multiple result files were created (multiple web-server runs) just supply all the paths as a vector of strings. See `details` for suggestion of how to run and obtain the result of the Pfam tool.
`showProgress`	A logic indicating whether to make a progress bar (if TRUE) or not (if FALSE). Default is TRUE.
`quiet`	A logic indicating whether to avoid printing progress messages (incl. progress bar). Default is FALSE

A protein domain is a part of a protein which by itself can maintain a fixed three-dimensional structure. Protein domains are found in most proteins and usually have a specific function.

The PFAM webserver is quite strict with regards to the number of sequences in the files uploaded so we suggest multiple runs each with one of the the files containing subsets. See extractSequence for info on how to split the amino acid fasta files.

Notes for how to run the external tools:
Use default parameters. If you want to use the webserver it is easily done as follows:. 1) Go to https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan 2) Switch to the the "Upload a File" tab. 3) Upload the amino avoid file (_AA) created with extractSequence file and add your mail address - this is important because there is currently no way of downloading the web output so you need them to send the result to your email. 4) Check Pfam is selected in the "HMM database" window. 5) Submit your job. 6) Wait till you receive the email with the result (usually quite fast). 7) Copy/paste the result part of the (ONLY what is below the line starting with "seq id") into an empty plain text document (notepad, sublimetext TextEdit or similar (not word)). 8) Save the document and supply the path to that document to analyzePFAM()

To run PFAM locally you should use the pfam_scan.pl script as described in the readme at ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/ and supply the path to the result file to analyzePFAM().

Protein domains are only added to isoforms annotated as having an ORF even if other isoforms exists in the file. This means if you quantify the same isoform many times you can just run pfam once on all isoforms and then supply the entire file to analyzePFAM().

Please note that the analyzePFAM() function will automatically only import the Pfam results from the isoforms stored in the switchAnalyzeRlist - even if many more are stored in the result file.

A column called 'domain_identified' is added to isoformFeatures containing a binary indication (yes/no) of whether a transcript contains any protein domains or not. Furthermore the data.frame 'domainAnalysis' is added to the switchAnalyzeRlist containing the details about domain names(s) and position for each transcript (where domain(s) were found).

The data.frame added have one row per isoform and contains the columns:

isoform_id: The name of the isoform analyzed. Matches the 'isoform_id' entry in the 'isoformFeatures' entry of the switchAnalyzeRlist
orf_aa_start: The start coordinate given as amino acid position (of the ORF).
orf_aa_end: The end coordinate given as amino acid position (of the ORF).
hmm_acc: A id which pfam have given to the domain
hmm_name: The name of the domain
clan: The can which the domain belongs to
transcriptStart: The transcript coordinate of the start of the domain.
transcriptEnd: The transcript coordinate of the end of the domain.
pfamStarExon: The exon index in which the start of the domain is located.
pfamEndExon: The exon index in which the end of the domain is located.
pfamStartGenomic: The genomic coordinate of the start of the domain.
pfamEndGenomic: The genomic coordinate of the end of the domain.

Furthermore depending on the exact tool used (local vs web-server) additional columns are added with information such as E score and type.

Kristoffer Vitting-Seerup

This function : Vitting-Seerup et al. The Landscape of Isoform Switches in Human Cancers. Mol. Cancer Res. (2017).
Pfam : Finn et al. The Pfam protein families database. Nucleic Acids Research (2014) Database Issue 42:D222-D230

createSwitchAnalyzeRlist
extractSequence
analyzeCPAT
analyzeSignalP
analyzeNetSurfP2
analyzeSwitchConsequences

### Load example data (matching the result files also store in IsoformSwitchAnalyzeR)
data("exampleSwitchListIntermediary")
exampleSwitchListIntermediary

### Add PFAM analysis
exampleSwitchListAnalyzed <- analyzePFAM(
    switchAnalyzeRlist   = exampleSwitchListIntermediary,
    pathToPFAMresultFile = system.file("extdata/pfam_results.txt", package = "IsoformSwitchAnalyzeR"),
    showProgress=FALSE
    )

exampleSwitchListAnalyzed