R/get_fasta_from_rcsb_entry.R

Defines functions get_fasta_from_rcsb_entry

Documented in get_fasta_from_rcsb_entry

#' Retrieve FASTA Sequence from PDB Entry or Specific Chain
#'
#' This function retrieves FASTA sequences from the RCSB Protein Data Bank (PDB) for a specified entry ID (\code{rcsb_id}). It can return either the full set of sequences associated with the entry or, if specified, the sequence corresponding to a particular chain within that entry. This flexibility makes it a useful tool for bioinformaticians and structural biologists needing access to protein or nucleic acid sequences.
#'
#' @param rcsb_id A string representing the PDB ID for which the FASTA sequence is to be retrieved. This is the primary identifier of the entry in the PDB database.
#' @param chain_id A string representing the specific chain ID within the PDB entry for which the FASTA sequence is to be retrieved. If \code{chain_id} is NULL (the default), the function will return all sequences associated with the entry. The chain ID should match one of the chain identifiers in the PDB entry (e.g., "A", "B").
#' @param verbosity A boolean flag indicating whether to print status messages during the function execution. When set to \code{TRUE} (the default), the function will output messages detailing the progress and any issues encountered.
#' @param fasta_base_url A string representing the base URL for the FASTA retrieval. By default, this is set to the global constant \code{FASTA_BASE_URL}, but users can specify a different URL if needed.
#'
#' @return
#' * If \code{chain_id} is NULL, the function returns a list of FASTA sequences associated with the provided \code{rcsb_id}, where organism names or chain descriptions are used as keys.
#' * If \code{chain_id} is specified, the function returns a character string representing the FASTA sequence for that specific chain.
#' * If the specified \code{chain_id} is not found in the PDB entry, the function will stop execution with an informative error message.
#'
#' @details
#' The function queries the RCSB PDB database using the provided entry ID (\code{rcsb_id}) and optionally a chain ID (\code{chain_id}). It sends an HTTP GET request to retrieve the corresponding FASTA file. The response is then parsed into a list of sequences. If a chain ID is provided, the function will return only the sequence corresponding to that chain. If no chain ID is provided, all sequences are returned.
#'
#' If a request fails, the function provides informative error messages. In the case of a network failure, the function will stop execution with a clear error message. Additionally, if the chain ID does not exist within the entry, the function will return an appropriate error message indicating that the chain was not found.
#'
#' The function also supports passing a custom base URL for the FASTA file retrieval, providing flexibility for users working with different PDB mirrors or services.
#'
#' @examples
#' # Example 1: Retrieve all FASTA sequences for the entry 4HHB
#' all_sequences <- get_fasta_from_rcsb_entry("4HHB", verbosity = TRUE)
#' print(all_sequences)
#'
#' # Example 2: Retrieve the FASTA sequence for chain A of entry 4HHB
#' chain_a_sequence <- get_fasta_from_rcsb_entry("4HHB", chain_id = "A", verbosity = TRUE)
#' print(chain_a_sequence)
#'
#' @importFrom httr GET http_status content
#' @export


get_fasta_from_rcsb_entry <- function(rcsb_id, chain_id = NULL, verbosity = TRUE, fasta_base_url = FASTA_BASE_URL) {
  if (verbosity) {
    message(paste0("Querying RCSB for the '", rcsb_id, "' FASTA file."))
  }

  # Send request to the base URL
  response <- tryCatch(
    {
      GET(paste0(fasta_base_url, rcsb_id))
    },
    error = function(e) {
      stop("Failed to retrieve data from the RCSB PDB. Network error: ", e$message)
    }
  )

  # Check for successful response
  if (http_status(response)$category != "Success") {
    stop("Request failed with status code ", http_status(response)$status, ": ", content(response, "text", encoding = "UTF-8"))
  }

  # Parse the FASTA text
  fasta_sequences <- tryCatch(
    {
      parse_fasta_text_to_list(content(response, "text", encoding = "UTF-8"))
    },
    error = function(e) {
      stop("Failed to parse FASTA response from RCSB PDB. The response may not be in the expected format. Error: ", e$message)
    }
  )

  if (is.null(chain_id)) {
    if (length(fasta_sequences) == 0) {
      stop("No FASTA sequences were found for the entry ID '", rcsb_id, "'.")
    }
    return(fasta_sequences)
  }

  # Find and return the sequence for the specified chain ID
  for (header in names(fasta_sequences)) {
    if (grepl(paste0("\\b", chain_id, "\\b"), header)) {
      chain_sequence <- fasta_sequences[[header]]
      return(chain_sequence)
    }
  }

  stop(paste0("Chain ID '", chain_id, "' not found in PDB entry '", rcsb_id ,". Please check the chain ID and try again."))
}

Try the rPDBapi package in your browser

Any scripts or data that you put into this service are public.

rPDBapi documentation built on Sept. 11, 2024, 6:37 p.m.