Nothing
#' Creates a contact map of all atoms from a structure file (using parallel processing)
#'
#' This function is a wrapper around `create_structure_contact_map()` that allows the use of all
#' system cores for the creation of contact maps. Alternatively, it can be used for sequential
#' processing of large datasets. The benefit of this function over `create_structure_contact_map()`
#' is that it processes contact maps in batches, which is recommended for large datasets. If used
#' for parallel processing it should only be used on systems that have enough memory available.
#' Workers can either be set up manually before running the function with
#' `future::plan(multisession)` or automatically by the function (maximum number of workers
#' is 12 in this case). If workers are set up manually the `processing_type` argument should
#' be set to "parallel manual". In this case workers can be terminated after completion with
#' `future::plan(sequential)`.
#'
#' @param data a data frame containing at least a column with PDB ID information of which the name
#' can be provided to the \code{id} argument. If only this column is provided, all atom or residue
#' distances are calculated. Additionally, a chain column can be present in the data frame of which
#' the name can be provided to the \code{chain} argument. If chains are provided, only distances
#' of this chain relative to the rest of the structure are calculated. Multiple chains can be
#' provided in multiple rows. If chains are provided for one structure but not for another, the
#' rows should contain NAs. Furthermore, specific residue positions can be provided in the \code{auth_seq_id}
#' column if the selection should be further reduced. It is not recommended to create full
#' contact maps for more than a few structures due to time and memory limitations. If contact maps are
#' created only for small regions it is possible to create multiple maps at once. By default distances
#' of regions provided in this data frame to the complete structure are computed. If distances of regions
#' from this data frame to another specific subset of regions should be computed, the second subset
#' of regions can be provided through the optional \code{data2} argument.
#' @param data2 optional, a data frame that contains a subset of regions for which distances to regions
#' provided in the \code{data} data frame should be computed. If regions from the \code{data} data
#' frame should be compared to the whole structure, data2 does not need to be provided.
#' This data frame should have the same structure and column names as the \code{data} data frame.
#' @param id a character column in the \code{data} data frame that contains PDB or UniProt IDs for
#' structures or AlphaFold predictions of which contact maps should be created. If a structure not
#' downloaded directly from PDB is provided (i.e. a locally stored structure file) to the
#' \code{structure_file} argument, this column should contain "my_structure" as content.
#' @param chain optional, a character column in the \code{data} data frame that contains chain
#' identifiers for the structure file. Identifiers defined by the structure author should be used.
#' Distances will be only calculated between the provided chains and the rest of the structure.
#' @param auth_seq_id optional, a character (or numeric) column in the \code{data} data frame
#' that contains semicolon separated positions of regions for which distances should be calculated.
#' This always needs to be provided in combination with a corresponding chain in \code{chain}.
#' The position should match the positioning defined by the structure author. For
#' PDB structures this information can be obtained from the \code{find_peptide_in_structure}
#' function. The corresponding column in the output is called \code{auth_seq_id}. If an
#' AlphaFold prediction is provided, UniProt positions should be used. If single positions
#' and not stretches of amino acids are provided, the column can be numeric and does not need
#' to contain the semicolon separator.
#' @param distance_cutoff a numeric value specifying the distance cutoff in Angstrom. All values
#' for pairwise comparisons are calculated but only values smaller than this cutoff will be
#' returned in the output. If a cutoff of e.g. 5 is selected then only residues with a distance of
#' 5 Angstrom and less are returned. Using a small value can reduce the size of the contact map
#' drastically and is therefore recommended. The default value is 10.
#' @param pdb_model_number_selection a numeric vector specifying which models from the structure
#' files should be considered for contact maps. E.g. NMR models often have many models in one file.
#' The default for this argument is c(0, 1). This means the first model of each structure file is
#' selected for contact map calculations. For AlphaFold predictions the model number is 0
#' (only .pdb files), therefore this case is also included here.
#' @param return_min_residue_distance a logical value that specifies if the contact map should be
#' returned for all atom distances or the minimum residue distances. Minimum residue distances are
#' smaller in size. If atom distances are not strictly needed it is recommended to set this
#' argument to TRUE. The default is TRUE.
#' @param export a logical value that indicates if contact maps should be exported as ".csv". The
#' name of the file will be the structure ID. Default is \code{export = FALSE}.
#' @param export_location optional, a character value that specifies the path to the location in
#' which the contact map should be saved if \code{export = TRUE}. If left empty, they will be
#' saved in the current working directory. The location should be provided in the following format
#' "folderA/folderB".
#' @param split_n a numeric value that specifies the number of structures that should be included
#' in each batch. Default is 40.
#' @param processing_type a character value that is either "parallel" for parallel processing or
#' "sequential" for sequential processing. Alternatively it can also be "parallel manual" in this
#' case you have to set up the number of cores on your own using the `future::plan(multisession)`
#' function. The default is "parallel".
#'
#' @return A list of contact maps for each PDB or UniProt ID provided in the input is returned.
#' If the \code{export} argument is TRUE, each contact map will be saved as a ".csv" file in the
#' current working directory or the location provided to the \code{export_location} argument.
#' @import dplyr
#' @importFrom purrr map2 flatten
#' @importFrom readr write_csv
#' @importFrom magrittr %>%
#' @importFrom rlang .data as_name enquo
#' @export
#'
#' @examples
#' \dontrun{
#' # Create example data
#' data <- data.frame(
#' pdb_id = c("6NPF", "1C14", "3NIR"),
#' chain = c("A", "A", NA),
#' auth_seq_id = c("1;2;3;4;5;6;7", NA, NA)
#' )
#'
#' # Create contact map
#' contact_maps <- parallel_create_structure_contact_map(
#' data = data,
#' id = pdb_id,
#' chain = chain,
#' auth_seq_id = auth_seq_id,
#' split_n = 1,
#' )
#'
#' str(contact_maps[["3NIR"]])
#'
#' contact_maps
#' }
parallel_create_structure_contact_map <- function(data,
data2 = NULL,
id,
chain = NULL,
auth_seq_id = NULL,
distance_cutoff = 10,
pdb_model_number_selection = c(0, 1),
return_min_residue_distance = TRUE,
export = FALSE,
export_location = NULL,
split_n = 40,
processing_type = "parallel") {
. <- NULL
dependency_test <- c(
furrr = !requireNamespace("furrr", quietly = TRUE),
future = !requireNamespace("future", quietly = TRUE),
parallel = !requireNamespace("parallel", quietly = TRUE),
drc = !requireNamespace("drc", quietly = TRUE)
)
if (any(dependency_test)) {
dependency_name <- names(dependency_test[dependency_test == TRUE])
if (length(dependency_name) == 1) {
message("Package \"",
paste(dependency_name),
"\" is needed for this function to work. Please install it.",
call. = FALSE
)
return(invisible(NULL))
} else {
message("Packages \"",
paste(dependency_name, collapse = "\" and \""),
"\" are needed for this function to work. Please install them.",
call. = FALSE
)
return(invisible(NULL))
}
}
terminate <- FALSE
if (processing_type == "parallel") {
message("Setting up workers ... ", appendLF = FALSE)
terminate <- TRUE
n_cores <- parallel::detectCores()
n_cores <- ifelse(n_cores > 12, 12, n_cores)
oplan <- future::plan(future::multisession, workers = n_cores)
on.exit(future::plan(oplan), add = TRUE)
message("DONE", appendLF = TRUE)
}
# if data2 was provided make sure that only IDs that overlap are retained
if (!missing(data2)) {
data <- data %>%
dplyr::filter({{ id }} %in% unique(dplyr::pull(data2, {{ id }})))
data2 <- data2 %>%
dplyr::filter({{ id }} %in% unique(dplyr::pull(data, {{ id }})))
}
unique_ids <- unique(dplyr::pull(data, {{ id }}))
pieces <- rep(1:ceiling(length(unique_ids) / split_n), split_n)[1:length(unique_ids)]
pieces_mapping <- tibble::tibble({{ id }} := unique(dplyr::pull(data, {{ id }})), piece = pieces)
input <- data %>%
dplyr::left_join(pieces_mapping, by = rlang::as_name(rlang::enquo(id))) %>%
split(dplyr::pull(., .data$piece))
if (!missing(data2)) {
input2 <- data2 %>%
dplyr::left_join(pieces_mapping, by = rlang::as_name(rlang::enquo(id))) %>%
split(dplyr::pull(., .data$piece))
}
message("Performing contact map calculations ... ", appendLF = FALSE)
if (missing(data2)) {
result <- furrr::future_map(
.x = input,
.f = ~ protti::create_structure_contact_map(
data = .x,
id = {{ id }},
chain = {{ chain }},
auth_seq_id = {{ auth_seq_id }},
distance_cutoff = distance_cutoff,
pdb_model_number_selection = pdb_model_number_selection,
return_min_residue_distance = return_min_residue_distance,
show_progress = FALSE
),
.options = furrr::furrr_options(globals = FALSE)
)
message("DONE", appendLF = TRUE)
}
if (!missing(data2)) {
result <- furrr::future_map2(
.x = input,
.y = input2,
.f = ~ protti::create_structure_contact_map(
data = .x,
data2 = .y,
id = {{ id }},
chain = {{ chain }},
auth_seq_id = {{ auth_seq_id }},
distance_cutoff = distance_cutoff,
pdb_model_number_selection = pdb_model_number_selection,
return_min_residue_distance = return_min_residue_distance,
show_progress = FALSE
),
.options = furrr::furrr_options(globals = FALSE)
)
message("DONE", appendLF = TRUE)
}
if (terminate == TRUE) {
future::plan(future::sequential)
}
result <- result %>%
purrr::flatten()
if (export == FALSE) {
return(result)
} else {
message("Exporting contact maps ... ", appendLF = FALSE)
# make sure export location is correct if or if not provided.
if (missing(export_location)) {
export_location <- ""
} else {
export_location <- paste0(export_location, "/")
}
purrr::map2(
.x = result,
.y = names(result),
.f = ~ {
readr::write_csv(x = .x, file = paste0(export_location, .y, "_contact_map.csv"), progress = FALSE)
}
)
message("DONE", appendLF = TRUE)
return(invisible(NULL))
}
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.