assign_splice_sites: Assign intron donor and acceptor splice sites consensus

View source: R/introns.R

assign_splice_sitesR Documentation

Assign intron donor and acceptor splice sites consensus

Description

This function takes a data frame of intron coordinates and a genome sequence (ideally human or mouse) and returns a data frame with two additional columns for the donor and acceptor splice site consensus sequences. It prepares the donor and acceptor sequences based on the provided intron coordinates and the specified genome (e.g., human hg38), making it useful for downstream analysis of splicing events.

Usage

assign_splice_sites(input, genome = BSgenome.Hsapiens.UCSC.hg38, verbose = TRUE)

Arguments

input

A data frame containing intron coordinates with the following columns:

  • seqnames: The chromosome name.

  • intron_start: The start position of the intron.

  • intron_end: The end position of the intron.

  • strand: The strand on which the intron is located (+ or -).

  • transcript_id: The ID of the transcript to which the intron belongs.

  • intron_number: The number of the intron within the transcript.

  • gene_name: The name of the gene.

  • gene_id: The gene ID.

genome

The genome sequence (BSgenome object) for the species. Default is the human genome (hg38). This object is required for extracting the consensus sequences from the genome at the specified intron positions.

verbose

Logical. If TRUE, the function prints progress messages while preparing the splice site data. Default is TRUE.

Details

This function performs the following steps:

  • First, it prepares the splice site sequences for both donor and acceptor sites by calculating their positions based on the strand orientation and intron coordinates. The donor splice site is typically located at the 5' end of the intron, while the acceptor splice site is at the 3' end.

  • The function utilizes the getSeq function from the BSgenome package to extract the nucleotide sequences for both donor and acceptor sites from the specified genome (default is hg38 for humans).

  • The resulting sequences are added as new columns (donor_ss and acceptor_ss) to the original input data frame.

  • The final data frame includes the splice site sequences for each intron, allowing for analysis of splicing efficiency or identification of consensus motifs.

Value

A data frame containing the original intron data, with two additional columns:

  • donor_ss: The donor splice site consensus sequence for each intron.

  • acceptor_ss: The acceptor splice site consensus sequence for each intron.

See Also

extract_introns, find_cryptic_splice_sites

Examples

suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
introns_df <- extract_introns(gtf_v1)
result <- assign_splice_sites(introns_df, genome = BSgenome.Hsapiens.UCSC.hg38)


GencoDymo2 documentation built on June 8, 2025, 10:29 a.m.