genomicLocsToProteinSequence: Obtaining the protein sequences and DNA sequences of the...

View source: R/genomicLocsToProteinSequence.R

genomicLocsToProteinSequenceR Documentation

Obtaining the protein sequences and DNA sequences of the coding regions within a list of loci in genome

Description

genomicsLocToProteinSequence takes a list of genomic loci given in the input and tries to find the protein sequences and DNA sequences of the coding regions of genome which are within those genomic loci.

Usage

genomicLocsToProteinSequence(inputLoci, CDSaaFile)

Arguments

inputLoci

A data frame containing the genomic loci as the input. Each row is for one genomic locus. The first column is for the chromosome, the 2nd and 3rd columns are for the start and end coordinates of the locus in the chromosome, and the 4th column is for the strand ("+" or "-" for forward and reverse strand, respectively). Other columns are optional and will not be used by the function. Note that the chromosome name can be either in the ENSEMBL style, e.g. 1, 2, 3, ..., and X, Y and MT, or in another popular style, namely chr1, chr2, chr3, ..., and chrX, chrY and chrM. But they cannot be mixed in the input of one function call.

CDSaaFile

The data file generated by the package's function generatingCDSaaFile, containing the genomic locations, DNA sequences and protein sequences of all coding regions in a specific genome which is used in your analysis.

Value

A data frame containing the original genomic loci specified in the input and the protein sequence and the DNA sequence of the coding regions within each of the loci. In detail, the returned data frame contains the original genomic loci specified in the input and after them, the five added columns:

  • Column "transId" lists the ENSEMBL IDs of the transcripts whose coding regions overlap with locus specified and the overlapping coding regions are exactly the same among those transcripts.

  • Column "dnaSeq" contains the DNA sequence in the overlapping coding regions.

  • Column "dnaBefore" contains the DNA letters which are in the same codon as the first letter in the DNA sequence in the column "dnaSeq".

  • Column "dnaAfter" contains the DNA letters which are in the same codon as the last letter in the DNA sequence in the previous column 'dnaSeq'.

  • Column "pepSeq" contains the protein sequence translated from the DNA sequences in the three preceding columns, "dnaBefore", "dnaSeq" and "dnaAfter".

Author(s)

Yaoyong Li

Examples


    dataFolder = system.file("extdata", package="geno2proteo")
    inputFile_loci=file.path(dataFolder, 
        "transId_pfamDomainStartEnd_chr16_Zdomains_22examples_genomicPos.txt")
    CDSaaFile=file.path(dataFolder, 
        "Homo_sapiens.GRCh37.74_chromosome16_35Mlong.gtf.gz_AAseq.txt.gz")

    inputLoci = read.table(inputFile_loci, sep="\t", stringsAsFactors=FALSE)

    proteinSeq = genomicLocsToProteinSequence(inputLoci=inputLoci, 
                                            CDSaaFile=CDSaaFile)


geno2proteo documentation built on June 13, 2022, 5:08 p.m.