parallel_create_structure_contact_map: Creates a contact map of all atoms from a structure file...

View source: R/parallel_create_structure_contact_map.R

parallel_create_structure_contact_mapR Documentation

Creates a contact map of all atoms from a structure file (using parallel processing)

Description

This function is a wrapper around create_structure_contact_map() that allows the use of all system cores for the creation of contact maps. Alternatively, it can be used for sequential processing of large datasets. The benefit of this function over create_structure_contact_map() is that it processes contact maps in batches, which is recommended for large datasets. If used for parallel processing it should only be used on systems that have enough memory available. Workers can either be set up manually before running the function with future::plan(multisession) or automatically by the function (maximum number of workers is 12 in this case). If workers are set up manually the processing_type argument should be set to "parallel manual". In this case workers can be terminated after completion with future::plan(sequential).

Usage

parallel_create_structure_contact_map(
  data,
  data2 = NULL,
  id,
  chain = NULL,
  auth_seq_id = NULL,
  distance_cutoff = 10,
  pdb_model_number_selection = c(0, 1),
  return_min_residue_distance = TRUE,
  export = FALSE,
  export_location = NULL,
  split_n = 40,
  processing_type = "parallel"
)

Arguments

data

a data frame containing at least a column with PDB ID information of which the name can be provided to the id argument. If only this column is provided, all atom or residue distances are calculated. Additionally, a chain column can be present in the data frame of which the name can be provided to the chain argument. If chains are provided, only distances of this chain relative to the rest of the structure are calculated. Multiple chains can be provided in multiple rows. If chains are provided for one structure but not for another, the rows should contain NAs. Furthermore, specific residue positions can be provided in the auth_seq_id column if the selection should be further reduced. It is not recommended to create full contact maps for more than a few structures due to time and memory limitations. If contact maps are created only for small regions it is possible to create multiple maps at once. By default distances of regions provided in this data frame to the complete structure are computed. If distances of regions from this data frame to another specific subset of regions should be computed, the second subset of regions can be provided through the optional data2 argument.

data2

optional, a data frame that contains a subset of regions for which distances to regions provided in the data data frame should be computed. If regions from the data data frame should be compared to the whole structure, data2 does not need to be provided. This data frame should have the same structure and column names as the data data frame.

id

a character column in the data data frame that contains PDB or UniProt IDs for structures or AlphaFold predictions of which contact maps should be created. If a structure not downloaded directly from PDB is provided (i.e. a locally stored structure file) to the structure_file argument, this column should contain "my_structure" as content.

chain

optional, a character column in the data data frame that contains chain identifiers for the structure file. Identifiers defined by the structure author should be used. Distances will be only calculated between the provided chains and the rest of the structure.

auth_seq_id

optional, a character (or numeric) column in the data data frame that contains semicolon separated positions of regions for which distances should be calculated. This always needs to be provided in combination with a corresponding chain in chain. The position should match the positioning defined by the structure author. For PDB structures this information can be obtained from the find_peptide_in_structure function. The corresponding column in the output is called auth_seq_id. If an AlphaFold prediction is provided, UniProt positions should be used. If single positions and not stretches of amino acids are provided, the column can be numeric and does not need to contain the semicolon separator.

distance_cutoff

a numeric value specifying the distance cutoff in Angstrom. All values for pairwise comparisons are calculated but only values smaller than this cutoff will be returned in the output. If a cutoff of e.g. 5 is selected then only residues with a distance of 5 Angstrom and less are returned. Using a small value can reduce the size of the contact map drastically and is therefore recommended. The default value is 10.

pdb_model_number_selection

a numeric vector specifying which models from the structure files should be considered for contact maps. E.g. NMR models often have many models in one file. The default for this argument is c(0, 1). This means the first model of each structure file is selected for contact map calculations. For AlphaFold predictions the model number is 0 (only .pdb files), therefore this case is also included here.

return_min_residue_distance

a logical value that specifies if the contact map should be returned for all atom distances or the minimum residue distances. Minimum residue distances are smaller in size. If atom distances are not strictly needed it is recommended to set this argument to TRUE. The default is TRUE.

export

a logical value that indicates if contact maps should be exported as ".csv". The name of the file will be the structure ID. Default is export = FALSE.

export_location

optional, a character value that specifies the path to the location in which the contact map should be saved if export = TRUE. If left empty, they will be saved in the current working directory. The location should be provided in the following format "folderA/folderB".

split_n

a numeric value that specifies the number of structures that should be included in each batch. Default is 40.

processing_type

a character value that is either "parallel" for parallel processing or "sequential" for sequential processing. Alternatively it can also be "parallel manual" in this case you have to set up the number of cores on your own using the future::plan(multisession) function. The default is "parallel".

Value

A list of contact maps for each PDB or UniProt ID provided in the input is returned. If the export argument is TRUE, each contact map will be saved as a ".csv" file in the current working directory or the location provided to the export_location argument.

Examples

## Not run: 
# Create example data
data <- data.frame(
  pdb_id = c("6NPF", "1C14", "3NIR"),
  chain = c("A", "A", NA),
  auth_seq_id = c("1;2;3;4;5;6;7", NA, NA)
)

# Create contact map
contact_maps <- parallel_create_structure_contact_map(
  data = data,
  id = pdb_id,
  chain = chain,
  auth_seq_id = auth_seq_id,
  split_n = 1,
)

str(contact_maps[["3NIR"]])

contact_maps

## End(Not run)

protti documentation built on Oct. 22, 2024, 1:06 a.m.