clean_data: Retrieve, Clean, and Format Input Data

View source: R/clean_data.R

clean_dataR Documentation

Retrieve, Clean, and Format Input Data

Description

This function cleans and formats input data. The cleaning and formatting portion involves removing any non-protein coding transcripts, removing any principal transcripts, and standardizing all column names. If the sequence is provided directly, the function also extracts the APPRIS annotation and UniProt IDs of each transcript from Ensembl. Provided data can follow 2 formats — the first option only contain transcript IDs and gene names and the second option contains a unique transcript identifier, gene names, and amino acid sequences. The function will return a data frame containing the transcript IDs, gene names, and APPRIS Annotation for each inputted transcript. If the amino acid sequence is included in the input data, this will also be included in the data frame. If only gene names and transcript IDS are provided, UniProt IDs will be included in the data frame.

Usage

clean_data(data_file, if_aa, organism)

Arguments

data_file

Path to the input file

if_aa

Boolean value indicating if the input file contains amino acid sequences with TRUE indicating that sequences are present and FALSE indicating that only IDs are present

organism

String indicating if the transcripts are from a human or a mouse

Value

A data frame containing gene names, transcript IDs, and APPRIS annotations for the given data. If sequences were provided, the data frame will also contain amino acid sequences. If only IDs were provided, the data frame will also contain the UniProt Swissprot ID, UniProt Swissprot isoform ID, and UniProt TREMBL ID.


EliLillyCo/surfaltr documentation built on May 3, 2022, 10:12 a.m.