perform_search: Perform a Search in the RCSB PDB

View source: R/perform_search.R

perform_searchR Documentation

Description

This function allows users to perform highly customizable searches in the RCSB Protein Data Bank (PDB) by specifying detailed search criteria. It interfaces directly with the RCSB PDB's RESTful API, enabling complex queries to retrieve specific data, such as PDB entries, assemblies, polymer entities, non-polymer entities, and more.

Usage

perform_search(
  search_operator,
  return_type = "ENTRY",
  request_options = NULL,
  return_with_scores = FALSE,
  return_raw_json_dict = FALSE,
  verbosity = TRUE
)

Arguments

search_operator

An object that specifies the search criteria. This object can be constructed using various operator functions:

DefaultOperator

A basic search operator for general search operations.

ExactMatchOperator

For exact match searches, matching an exact attribute value.

InOperator

For searching where the attribute value must be within a specified set of values.

ContainsWordsOperator

For searching attributes that contain certain words.

ContainsPhraseOperator

For searching attributes that contain a specific phrase.

ComparisonOperator

For comparison-based searches, such as finding values greater than, less than, or equal to a specified value.

RangeOperator

For searching within a range of values for a given attribute.

ExistsOperator

To check the existence of a specific attribute in the database.

StructureOperator

For structure-based searches, using PDB entry IDs, assembly IDs, and search modes.

SequenceOperator

For sequence-based searches, using sequences, sequence types, and cutoffs for e-value and identity.

SeqMotifOperator

For searching sequence motifs, using pattern types like SIMPLE, PROSITE, or REGEX.

ChemicalOperator

For chemical structure searches, using SMILES or InChI descriptors and various matching criteria.

Please see the Details section.

return_type

A string specifying the type of data to return. The available options for return_type include:

"ENTRY"

Returns a list of PDB IDs corresponding to the entries that match the search criteria. This is the default option and provides entry-level information.

"ASSEMBLY"

Returns a list of PDB IDs appended with assembly IDs (formatted as "PDB_ID-ASSEMBLY_ID"). Useful for accessing specific biological assemblies.

"POLYMER_ENTITY"

Returns a list of PDB IDs appended with entity IDs for polymeric molecular entities. Useful for examining specific polymer chains.

"NON_POLYMER_ENTITY"

Returns a list of PDB IDs appended with entity IDs for non-polymeric entities, such as ligands or small molecules. Useful for detailed chemical analysis.

"POLYMER_INSTANCE"

Returns a list of PDB IDs appended with asym IDs, representing specific instances of polymeric entities (e.g., protein chains).

"CHEMICAL_COMPONENT"

Returns a list of chemical component identifiers, useful for detailed chemical analysis.

request_options

A list of additional options to further customize the search request. These options can include:

facets

Faceted queries allow aggregation of search results into categories (buckets) based on the requested field values. Useful for statistical analysis and data aggregation.

sort_by

Defines the sorting criteria for the search results (e.g., by resolution, release date).

pagination

Controls how many results to return per page and which page of results to return. Useful for handling large datasets.

return_all_hits

If set to TRUE, the search returns all matching results; otherwise, a limited set is returned.

return_with_scores

Logical; if TRUE, the search results will include relevance scores. Useful when prioritizing results based on their relevance to the search criteria. Default is FALSE.

return_raw_json_dict

Logical; if TRUE, the function returns the raw JSON response from the PDB API. This option is valuable for advanced users who wish to process the raw data themselves or need access to additional details. Default is FALSE.

verbosity

Logical; if TRUE, detailed messages will be displayed during execution, providing insights into the query being sent and the response received. Verbose mode is useful for debugging or when you need insights into the function's operation. Default is TRUE.

Details

The operators allow you to build complex search queries tailored to your specific needs. Detailed documentation for each search operator can be found in the RCSB PDB Search Operators. The searchable attributes include annotations from the mmCIF dictionary, external resources, and those added by RCSB PDB. Both internal additions to the mmCIF dictionary and external resource annotations are prefixed with 'rcsb_'. For a complete list of available attributes for text searches, refer to the Structure Attributes Search and Chemical Attributes Search pages.

Value

The function returns search results based on the specified return_type:

ENTRY

A vector of PDB IDs that match the search criteria.

ASSEMBLY

A list of PDB IDs with appended assembly IDs, formatted as "PDB_ID-ASSEMBLY_ID".

POLYMER_ENTITY

A list of PDB IDs with appended entity IDs for polymeric chains.

NON_POLYMER_ENTITY

A list of PDB IDs with appended entity IDs for non-polymeric components.

POLYMER_INSTANCE

A list of PDB IDs with appended asym IDs for specific polymer instances.

CHEMICAL_COMPONENT

A list of chemical component identifiers.

Examples


# Example 1: Search for Polymer Entities from Mus musculus and Homo sapiens
search_operator <- InOperator(
  attribute = "rcsb_entity_source_organism.taxonomy_lineage.name",
  value = c("Mus musculus", "Homo sapiens")
)
results <- perform_search(
  search_operator = search_operator,
  return_type = "POLYMER_ENTITY"
)
results

# Example 2: Search for Entries Released After a Specific Date
operator_date <- ComparisonOperator(
  attribute = "rcsb_accession_info.initial_release_date",
  value = "2019-08-20",
  comparison_type = "GREATER"
)
request_options <- list(
  facets = list(
    list(
      name = "Methods",
      aggregation_type = "terms",
      attribute = "exptl.method"
    )
  )
)
results <- perform_search(
  search_operator = operator_date,
  return_type = "ENTRY",
  request_options = request_options
)
results

# Example 3: Search for Symmetric Dimers with DNA-Binding Domain
operator_symbol <- ExactMatchOperator(
  attribute = "rcsb_struct_symmetry.symbol",
  value = "C2"
)
operator_kind <- ExactMatchOperator(
  attribute = "rcsb_struct_symmetry.kind",
  value = "Global Symmetry"
)
operator_full_text <- DefaultOperator(
  value = "\"heat-shock transcription factor\""
)
operator_dna_count <- ComparisonOperator(
  attribute = "rcsb_entry_info.polymer_entity_count_DNA",
  value = 1,
  comparison_type = "GREATER_OR_EQUAL"
)
query_group <- list(
  type = "group",
  logical_operator = "and",
  nodes = list(
    list(
      type = "terminal",
      service = "text",
      parameters = operator_symbol
    ),
    list(
      type = "terminal",
      service = "text",
      parameters = operator_kind
    ),
    list(
      type = "terminal",
      service = "full_text",
      parameters = operator_full_text
    ),
    list(
      type = "terminal",
      service = "text",
      parameters = operator_dna_count
    )
  )
)
results <- perform_search(
  search_operator = query_group,
  return_type = "ASSEMBLY"
)
results


rPDBapi documentation built on Sept. 11, 2024, 6:37 p.m.