protein: Protein Secondary Structure Data

proteinR Documentation

Protein Secondary Structure Data

Description

This dataset contains protein sequences and their corresponding secondary structures, including beta-sheets (E), helices (H), and coils (_).

Usage

protein

Format

A data frame with multiple rows and columns representing protein sequences and their secondary structures.

  • Sequence: Amino acid sequence (using 3-letter codes).

  • Structure: Secondary structure of the protein (E for beta-sheet, H for helix, _ for coil).

  • Parameters: Additional parameters for neural networks (to be ignored).

  • Biophysical_Constants: Biophysical constants (to be ignored).

Details

The dataset is used for predicting protein secondary structures from amino acid sequences. The first few numbers in each sequence are parameters for neural networks and should be ignored. The '<' symbol is used as a spacer between proteins and to mark the beginning and end of sequences.

Note

The biophysical constants included in the dataset were found to be unhelpful and are generally ignored in analysis.

Source

Vince G. Sigillito, Applied Physics Laboratory, Johns Hopkins University.

Examples

# Load the dataset
data(protein)

# Print the first few rows of the dataset
print(head(protein))

LFM documentation built on April 16, 2025, 9:07 a.m.

Related to protein in LFM...