R/data_fetcher.R

Defines functions data_fetcher

Documented in data_fetcher

#' Fetch RCSB PDB Data Based on Specified Criteria
#'
#' The `data_fetcher` function provides a flexible way to access data from the RCSB Protein Data Bank (PDB). By specifying an identifier, data type, and a set of properties, users can tailor the data retrieval process to meet their specific research needs. The function integrates several steps, including validating IDs, generating a JSON query, fetching the data, and formatting the response.
#'
#' @param id A single identifier or a list of identifiers for the data to be fetched. These IDs correspond to the entries, assemblies, polymer entities, or other entities within the RCSB PDB. The ID must match the data type you are querying (e.g., PDB ID for entries, assembly ID for assemblies).
#'
#' @param data_type A string specifying the type of data to fetch. The available options for \code{data_type} include:
#' \describe{
#'   \item{\code{"ENTRY"}}{Fetches data related to PDB entries. This is the default option and returns basic entry-level information.}
#'   \item{\code{"ASSEMBLY"}}{Fetches data related to biological assemblies. The IDs should be formatted as \code{"PDB_ID-ASSEMBLY_ID"}.}
#'   \item{\code{"POLYMER_ENTITY"}}{Fetches data related to polymeric molecular entities, such as protein chains.}
#'   \item{\code{"BRANCHED_ENTITY"}}{Fetches data related to branched entities, such as carbohydrates or other branched molecules.}
#'   \item{\code{"NONPOLYMER_ENTITY"}}{Fetches data related to non-polymeric entities, such as ligands, cofactors, or small molecules.}
#'   \item{\code{"POLYMER_ENTITY_INSTANCE"}}{Fetches data related to specific instances of polymeric entities, also known as chains.}
#'   \item{\code{"BRANCHED_ENTITY_INSTANCE"}}{Fetches data related to specific instances of branched entities.}
#'   \item{\code{"NONPOLYMER_ENTITY_INSTANCE"}}{Fetches data related to specific instances of non-polymeric entities.}
#'   \item{\code{"CHEMICAL_COMPONENT"}}{Fetches data related to chemical components, such as individual atoms or molecular fragments.}
#' }
#' Each \code{data_type} corresponds to a specific hierarchy level within the PDB data structure, and selecting the appropriate type ensures that you retrieve relevant and accurate data.
#'
#' @param properties A list or dictionary of properties to be included in the data fetching process. The properties should match the data type you are querying. For example, if you are fetching \code{POLYMER_ENTITY} data, you might specify properties such as \code{"rcsb_entity_source_organism"} or \code{"rcsb_cluster_membership"}.
#' The \code{properties} argument allows users to customize the data they retrieve by specifying exactly which attributes of the data type they are interested in. It is important to match the properties to the correct data type to ensure accurate and meaningful results.
#' The \href{https://data.rcsb.org/#data-schema}{RCSB PDB Data Schema} provides detailed documentation on the properties available for each data type. Users are encouraged to refer to this resource when selecting properties for their queries.
#'
#' @param return_as_dataframe A boolean indicating whether to return the response as a dataframe. If \code{TRUE}, the data is returned in a structured dataframe format, which is convenient for analysis and manipulation in R. If \code{FALSE}, the data is returned in its original format, which may be a nested list or JSON-like structure. Default is \code{TRUE}.
#' @param verbosity A boolean flag indicating whether to print status messages during the function execution. When set to \code{TRUE}, the function will output messages detailing the progress and any issues encountered.
#'
#' @return Depending on the value of \code{return_as_dataframe}, this function returns either a dataframe or the raw data in its original format. The dataframe format is particularly useful for further data analysis and visualization within R, while the raw format may be preferred for more complex or custom data processing tasks.
#'
#' @details The `data_fetcher` function is particularly useful for researchers who need to access and analyze specific subsets of PDB data. By providing a list of IDs and the corresponding data type, users can retrieve only the information relevant to their study, reducing the need to manually filter or process large datasets. The function also supports fetching multiple properties simultaneously, allowing for a more comprehensive data retrieval process.
#'
#'
#' @importFrom purrr flatten
#' @examples
#' # Example 1: Fetching basic entry information
#' properties <- list(cell = c("length_a", "length_b", "length_c"), exptl = c("method"))
#' data_fetcher(
#'   id = c("4HHB"),
#'   data_type = "ENTRY",
#'   properties = properties,
#'   return_as_dataframe = TRUE
#' )
#'
#' # Example 2: Fetching polymer entity data
#' properties <- list(
#'   rcsb_entity_source_organism = c("ncbi_taxonomy_id", "ncbi_scientific_name"),
#'   rcsb_cluster_membership = c("cluster_id", "identity")
#' )
#' data_fetcher(
#'   id = c("4HHB_1", "12CA_1"),
#'   data_type = "POLYMER_ENTITY",
#'   properties = properties,
#'   return_as_dataframe = TRUE
#' )
#'
#' # Example 3: Fetching non-polymer entity data
#' properties <- list(
#'   rcsb_nonpolymer_entity = c("details", "formula_weight", "pdbx_description"),
#'   rcsb_nonpolymer_entity_container_identifiers = c("chem_ref_def_id")
#' )
#' data_fetcher(
#'   id = c("3PQR_5", "3PQR_6"),
#'   data_type = "NONPOLYMER_ENTITY",
#'   properties = properties,
#'   return_as_dataframe = TRUE
#' )
#'
#' # Example 4: Fetching chemical component data
#' properties <- list(
#'   rcsb_id = list(),
#'   chem_comp = list("type", "formula_weight", "name", "formula"),
#'   rcsb_chem_comp_info = list("initial_release_date")
#' )
#' data_fetcher(
#'   id = c("NAG", "EBW"),
#'   data_type = "CHEMICAL_COMPONENT",
#'   properties = properties,
#'   return_as_dataframe = TRUE
#' )
#'
#' @export
data_fetcher <- function(id = NULL, data_type = "ENTRY", properties = NULL, return_as_dataframe = TRUE, verbosity = FALSE) {

  # Validate input parameters
  if (is.null(id) || length(id) == 0) {
    stop("Invalid input: 'id' must not be NULL or empty.")
  }

  if (!data_type %in% c("ENTRY", "POLYMER_ENTITY", "BRANCHED_ENTITY",  "ASSEMBLY", "NONPOLYMER_ENTITY",
                        "POLYMER_ENTITY_INSTANCE", "BRANCHED_ENTITY_INSTANCE", "NONPOLYMER_ENTITY_INSTANCE", "CHEMICAL_COMPONENT")) {
    stop("Invalid 'data_type'. Please provide a valid data type from the following options:
           'ENTRY', 'POLYMER_ENTITY', 'BRANCHED_ENTITY', 'ASSEMBLY', 'NONPOLYMER_ENTITY',
         'POLYMER_ENTITY_INSTANCE', 'BRANCHED_ENTITY_INSTANCE', 'NONPOLYMER_ENTITY_INSTANCE', 'CHEMICAL_COMPONENT.'")
  }

  if (is.null(properties) || length(properties) == 0) {
    stop("Invalid input: 'properties' must not be NULL or empty.")
  }

  # Add properties and generate query
  if (verbosity) {
    message("Adding properties and generating JSON query...")
  }

  json_query <- tryCatch(
    {
      generate_json_query(id, data_type, properties)
    },
    error = function(e) {
      stop("Failed to generate JSON query. Error: ", e$message)
    }
  )

  # Fetch data from the PDB
  response <- tryCatch(
    {
      fetch_data(json_query = json_query, data_type = data_type, ids = id)
    },
    error = function(e) {
      stop("Failed to fetch data from PDB. Error: ", e$message)
    }
  )

  if (is.null(response)) {
    warning("No response was received for the provided query.")
    return(NULL)
  }

  if (verbosity) {
    message("Data fetched successfully. Preparing response...")
  }

  # Optionally return as a dataframe
  if (return_as_dataframe) {
    response <- tryCatch(
      {
        return_data_as_dataframe(response, data_type, id)
      },
      error = function(e) {
        stop("Failed to convert response to dataframe. Error: ", e$message)
      }
    )
  }

  return(response)
}

Try the rPDBapi package in your browser

Any scripts or data that you put into this service are public.

rPDBapi documentation built on Sept. 11, 2024, 6:37 p.m.