R/STList.R

Defines functions process_meta process_sample_names_from_file process_sample_names process_sample_filepaths process_lists read_cosmx_input read_xenium_outs read_visium_outs read_matrices_fps read_seurat read_list_dfs expandSparse makeSparse STlist

Documented in STlist

##
#' @title STlist: Creation of STlist objects for spatial transcriptomics analysis
#' @description Creates an STlist object from one or multiple spatial transcriptomic samples.
#' @details
#' Objects of the S4 class STlist are the starting point of analyses in **`spatialGE`**.
#' The STlist contains data from one or multiple samples (i.e., tissue slices), and
#' results from most `spatialGE`'s functions are stored within the object.
#' \itemize{
#' \item Raw gene counts and spatial coordinates. Gene count data have genes in rows and
#' sampling units (e.g., cells, spots) in columns. Spatial coordinates have
#' sampling units in rows and three columns: sample unit IDs, Y position, and X position.
#' \item Visium outputs from *Space Ranger*. The Visium directory must have the directory
#' structure resulting from `spaceranger count`, with either a count matrix represented in
#' MEX files or a h5 file. The directory should also contain a `spatial` sub-directory,
#' with the spatial coordinates (`tissue_positions_list.csv`), and
#' optionally the high resolution tissue image and scaling factor file `scalefactors_json.json`.
#' \item Xenium outputs from *Xenium Ranger*. The Xenium directory must have the directory
#' structure resulting from the `xeniumranger` pipeline, with either a cell-feature matrix
#' represented in MEX files or a h5 file. The directory should also contain a parquet file,
#' with the spatial coordinates (`cells.parquet`).
#' \item CosMx-SMI outputs. Two files are required to process SMI outputs: The `exprMat` and
#' `metadata` files. Both files must contain the "fov" and "cell_ID" columns. In addition,
#' the `metadata` files must contain the "CenterX_local_px" and "CenterY_local_px" columns.
#'  \item Seurat object (V4). A Seurat V4 object produced via `Seurat::Load10X_Spatial`.
#' }
#' Optionally, the user can input a path to a file containing a table of sample-level
#' metadata (e.g., clinical outcomes, tissue type, age). This sample metadata file
#' should contain sample IDs in the first column partially matching the file names of
#' the count/coordinate file paths or Visium directories. _Note:_ The sample ID of a
#' given sample cannot be a substring of the sample ID of another sample. For example,
#' instead of using "tissue1" and "tissue12", use "tissue01" and "tissue12".
#'
#' The function uses parallelization if run in a Unix system. Windows users
#' will experience longer times depending on the number of samples.
#'
#' @param rnacounts the count data which can be provided in one of these formats:
#' \itemize{
#' \item File paths to comma- or tab-delimited files containing raw gene counts, one file
#' for each spatial sample. The first column contains gene names and subsequent columns
#' contain the expression data for each cell/spot. Duplicate gene names will be
#' modified using `make.unique`. Requires `spotcoords` and `samples`.
#' \item File paths to Visium output directories (one per spatial sample). The directory
#' must follow the structure resulting from `spaceranger count`. The directory contains
#' the `.h5` and `spatial` sub-directory. If no `.h5` file is available, sparse
#' matrices (MEX) from `spaceranger count`. In that case a second sub-directory
#' called `filtered_feature_bc_matrix` should contain contain the `barcodes.tsv.gz`,
#' `features.tsv.gz`, and `matrix.mtx.gz` files. The `spatial` sub-directory minimally
#' contains the coordinates (`tissue_positions_list.csv`), and optionally the high
#' resolution PNG image and accompanying scaling factors (`scalefactors_json.json`).
#' Requires `samples`.
#' #' \item File paths to Xenium output directories (one per spatial sample). The directory
#' must follow the structure resulting from the `xeniumranger` pipeline. The directory
#' contains the `.h5` or sparse matrices (MEX). In that case a second sub-directory
#' called `cell_feature_matrix` should contain contain the `barcodes.tsv.gz`,
#' `features.tsv.gz`, and `matrix.mtx.gz` files. The coordinates must be available
#' in the `cells.parquet`. Requires `samples`.
#' \item The `exprMat` file for each slide of a CosMx-SMI output. The file must contain
#' the "fov" and "cell_ID" columns. The `STlist` function will separate data from each
#' FOV, since analysis in spatialGE is conducted at the FOV level. Requires `samples` and
#' `spotcoords`.
#' \item Seurat object (V4). A Seurat V4 object produced via `Seurat::Load10X_Spatial`.
#' Multiple samples are allowed as long as they are stored as "slices" in the Seurat object.
#' Does not require `samples` as sample names are taken from `names(seurat_obj@images)`
#' \item A named list of data frames with raw gene counts (one data frame per spatial
#' sample). Requires `spotcoords`. Argument `samples` only needed when a file path to
#' sample metadata is provided.
#' }
#' @param spotcoords the cell/spot coordinates. Not required if inputs are Visium or
#' Xenium (spaceranger or xeniumranger outputs).
#' \itemize{
#' \item File paths to comma- or tab-delimited files containing cell/spot coordinates, one
#' for each spatial sample. The files must contain three columns: cell/spot IDs, Y positions, and
#' X positions. The cell/spot IDs must match the column names for each cells/spots (columns) in
#' the gene count files. Requires `samples` and `rnacounts`.
#' \item The `metadata` file for each slide of a CosMx-SMI output. The file must contain
#' the "fov", "cell_ID", "CenterX_local_px", and "CenterY_local_px" columns. The `STlist`
#' function will separate data from each FOV, since analysis in spatialGE is conducted at
#' the FOV level. Requires `samples` and `rnacounts`.
#' \item A named list of data frames with cell/spot coordinates. The list names must
#' match list names of the gene counts list
#' }
#' @param samples the sample names/IDs and (optionally) metadata associated with
#' each spatial sample.
#' The following options are available for `samples`:
#' \itemize{
#' \item A vector with sample names, which will be used to match gene the counts and
#' cell/spot coordinates file paths. A sample name must not match file
#' paths for two different samples. For example, instead of using "tissue1" and
#' "tissue12", use "tissue01" and "tissue12".
#' \item A path to a file containing a table with metadata. This file is a comma- or
#' tab-separated table with one sample per row and sample names/IDs in the first
#' column. Subsequent columns may contain variables associated with each spatial sample
#' }
#' @param cores integer indicating the number of cores to use during parallelization.
#' If NULL, the function uses half of the available cores at a maximum. The parallelization
#' uses `parallel::mclapply` and works only in Unix systems.
#' @return an STlist object containing the counts and coordinates, and optionally
#' the sample metadata, which can be used for downstream analysis with `spatialGE`
#'
#' @examples
##'
#' # Using included melanoma example (Thrane et al.)
#' # Download example data set from spatialGE_Data
#' thrane_tmp = tempdir()
#' unlink(thrane_tmp, recursive=TRUE)
#' dir.create(thrane_tmp)
#' lk='https://github.com/FridleyLab/spatialGE_Data/raw/refs/heads/main/melanoma_thrane.zip?download='
#' download.file(lk, destfile=paste0(thrane_tmp, '/', 'melanoma_thrane.zip'), mode='wb')
#' zip_tmp = list.files(thrane_tmp, pattern='melanoma_thrane.zip$', full.names=TRUE)
#' unzip(zipfile=zip_tmp, exdir=thrane_tmp)
#' # Generate the file paths to be passed to the STlist function
#' count_files <- list.files(paste0(thrane_tmp, '/melanoma_thrane'),
#'                           full.names=TRUE, pattern='counts')
#' coord_files <- list.files(paste0(thrane_tmp, '/melanoma_thrane'),
#'                           full.names=TRUE, pattern='mapping')
#' clin_file <- list.files(paste0(thrane_tmp, '/melanoma_thrane'),
#'                         full.names=TRUE, pattern='clinical')
#' # Create STlist
#' library('spatialGE')
#' melanoma <- STlist(rnacounts=count_files[c(1,3)],
#'                    spotcoords=coord_files[c(1,3)],
#'                    samples=clin_file) # Only first three samples
#' melanoma
#'
#' @export STlist
#'
#' @import Matrix
#' @importFrom magrittr %>%
#
STlist = function(rnacounts=NULL, spotcoords=NULL, samples=NULL, cores=NULL){
  # Check input type.
  input_check = detect_input(rnacounts=rnacounts, spotcoords=spotcoords, samples=samples)
  input_rnacounts = detect_input_rnacounts(rnacounts)
  input_spotcoords = detect_input_spotcoords(spotcoords)
  input_samples = detect_input_samples(samples)

  # Define number of available cores to use.
  if(is.null(cores) && !is.null(rnacounts)){
    cores = count_cores(length(rnacounts))
  } else{
    cores = as.integer(cores)
    if(is.na(cores)){
      raise_err(err_code='error0032')
    }
  }

  # Output error if input_check is empty (likely input format not recognized).
  if(rlang::is_empty(input_rnacounts)){
    raise_err(err_code='error0033')
  }

  # Parse sample names
  if(input_rnacounts == 'list_dfs'){
    sample_names = names(rnacounts)
  } else if(input_samples == 'sample_names'){
    sample_names = samples
  } else if(input_rnacounts == 'seurat'){
    sample_names = names(rnacounts@images)
  } else {
    if(input_samples == 'samplesfile_excel'){
      samples_df = readxl::read_excel(samples)
    } else if(input_samples == 'samplesfile_tab'){
      samples_df = readr::read_delim(samples, name_repair='unique', delim='\t', show_col_types=F)
    } else if(input_samples == 'samplesfile_comma'){
      samples_df = readr::read_delim(samples, name_repair='unique', delim=',', show_col_types=F)
    }
    sample_names = samples_df[[1]]
  }

  # Test the validity of sample names
  if(length(sample_names) > 0){
    # Sample names SHOULD NOT begin with a number
    test_number = any(sapply(sample_names, function(i){grepl("^[0-9]", i)}))
    if(test_number){
      raise_err(err_code='error0028')
    }
    # Test that sample names contain only alpha-numerics, spaces, dash, and underscores
    test_chars = any(!sapply(sample_names, function(i){grepl("^[ //-_//A-Za-z0-9]+$", i)}))
    if(test_chars){
      raise_err(err_code='error0028')
    }
    rm(test_number, test_chars) # Clean env
  } else{
    raise_err(err_code='error0004')
  }


  #### CASES BEGIN HERE ########################################################


  # CASE: NAMED LIST OF DATAFRAMES WITH COUNTS AND NAMED LIST OF DATA FRAMES WITH COORDINATES.
  # METADATA INFO OPTIONAL.
  if(input_rnacounts == 'list_dfs' && input_spotcoords == 'list_dfs'){
    cat(paste("Found list of dataframes.\n"))
    pre_lists = read_list_dfs(rnacounts, spotcoords)
  }


  # CASE: SEURAT OBJECT
  if(input_rnacounts == 'seurat'){
    requireNamespace('SeuratObject')
    cat(paste("Found Seurat object.\n"))
    pre_lists = read_seurat(rnacounts)
    img_obj = pre_lists[['images']]
    platform = 'visium' # FUTURE DEVELOPMENT: CANT ASSUME IT'S VISIUM... MAYBE SHOULD USE INFO FROM SEURAT OBJECT TO IDENTIFY TECH
  }


  # CASE: SAMPLE FILE PLUS FILE PATH(S) TO COUNT COORDINATE MATRICES OR VISIUM/XENIUM DIRS
  visium = c('visium_filtered_h5', 'visium_raw_h5', 'visium_filtered_mex', 'visium_raw_mex')
  xenium = c('xenium_h5', 'xenium_mex')
  cosmx = c('tab_delim_cosmx', 'comma_delim_cosmx')
  delim = c('tab_delim', 'comma_delim')
  if(input_rnacounts %in% c(visium, xenium, cosmx, delim)){
    # Get list of filepaths
    filepaths = process_sample_names(rnacounts, spotcoords, sample_names, input_rnacounts, input_spotcoords)

    # Check if input is Visium, Xenium, CosMx, or count+coord matrices
    if(input_rnacounts %in% visium){
      cat(paste("Found Visium data\n"))
      pre_lists = read_visium_outs(filepaths, input_rnacounts, cores=cores)
      img_obj = pre_lists[['images']]
      image_scale = pre_lists[['json_scale']]
      platform = 'visium'
    } else if(input_rnacounts %in% xenium){
      cat(paste("Found Xenium data\n"))
      pre_lists = read_xenium_outs(filepaths, input_rnacounts, cores=cores)
      platform = 'xenium'
    } else if(input_check$rna[1] == 'cosmx'){
      cat(paste("Found CosMx-SMI data\n"))
      pre_lists = read_cosmx_input(filepaths, input_check, cores=cores)
      img_obj = pre_lists[['images']]
      platform = 'cosmx'
    } else{
      cat(paste("Found matrix data\n"))
      pre_lists = read_matrices_fps(filepaths, input_check, cores=cores)
    }
  }


  #### CASES FINISH HERE ########################################################


  # No input provided
  if(is.null(input_check$rna) && is.null(input_check$coords) && is.null(input_check$samples)){
    stop('No input provided. Please refer to documentation.')
  }

  # Process count and coordinate lists before placing within STlist
  cat(paste("Matching gene expression and coordinate data...\n"))
  procLists = process_lists(counts_df_list=pre_lists[['counts']], coords_df_list=pre_lists[['coords']])

  # Process metadata if provided or make an empty tibble
  samples_df = tibble::tibble()
  if(input_check$samples[1] == 'samplesfile' || input_check$samples[1] %in% c('samplesfile_visium_h5', 'samplesfile_visium_mex') || input_check$samples[1] == 'samplesfile_matrices'){
    samples_df = process_meta(samples=samples, input_check=input_check, counts_df_list=procLists[['counts']])
  #}else if(input_check$samples[1] == 'samplesfile_geomx'){
  #  samples_df = process_meta_geomx(samples, input_check, procLists[['counts']], gmx_slide_col, gmx_meta_cols)
  }else{
    samples_df = tibble::tibble(sample_name=names(procLists[['counts']]))
  }
  # Make sure sample IDs are character
  samples_df[[1]] = as.character(samples_df[[1]])

  if(!is.null(input_check$rna[1])){
    if(!(input_check$rna[1] %in% c('visium_out_h5', 'visium_out_mex'))){
      cat(paste("Converting counts to sparse matrices\n"))
      procLists[['counts']] = parallel::mclapply(procLists[['counts']], function(x){
        makeSparse(x)
      })
    }
  } else if(!(input_check$samples[1] %in% c('samplesfile_visium_h5', 'samplesfile_visium_mex'))){
    cat(paste("Converting counts to sparse matrices\n"))
    procLists[['counts']] = parallel::mclapply(procLists[['counts']], function(x){
      makeSparse(x)
    })
  }

  # Detect if image from Visium out is available
  if(!exists('img_obj')){
    img_obj = NULL
  }

  # Detect if image scale info is available
  if(!exists('image_scale')){
    image_scale = NULL
  }

  # If no specific platform was found, then make generic
  if(!exists('platform')){
    platform = 'generic'
  }

  # Creates STlist object from processed data
  STlist_obj = new("STlist",
                   counts=procLists[['counts']],
                   spatial_meta=procLists[['coords']],
                   gene_meta=list(),
                   sample_meta=samples_df,
                   tr_counts=list(),
                   #gene_het=list(),
                   gene_krige=list(),
                   #cell_deconv=list(),
                   misc=list(platform=platform,
                             sp_images=img_obj,
                             image_scaling=image_scale,
                             sthet=list())
  )
  cat("Completed STlist!\n")
  return(STlist_obj)
}


# Helpers ----------------------------------------------------------------------

##
# makeSparse: takes a dataframe input and makes it sparse saving space
# wrapped from Matrix package
# @param dataframe, a dataframe or tibble with gene names in the first column and
# different spots in the following columns
# @return a sparsed data matrix
#
makeSparse = function(dataframe){

  # To prevent NOTES in R CMD check
  . = NULL

  if(any(class(dataframe) == 'dgCMatrix')){
    numdat = dataframe
  } else if(!is.matrix(dataframe)){
    # Force data frame if tibble (tibbles do not support rownames)
    if(tibble::is_tibble(dataframe)){
      dataframe = as.data.frame(dataframe)
    }
    genecol = colnames(dataframe)[1]

    numdat = dataframe
    rownames(numdat) = dataframe[[genecol]]
    numdat = numdat[, -which(colnames(numdat) == genecol)]
    numdat = as.matrix(numdat)
    numdat = as(numdat, "sparseMatrix")
  } else {
    numdat = dataframe %>% as(., "sparseMatrix")
  }
  return(numdat)
}

##
# expandSparse: takes a sparsed matrix and returns a dataframe
# @param sparsedMatrix, a sparse matrix made from the Matrix
# package
# @return a data frame
#
expandSparse = function(sparsedMatrix){
  NonSparse = data.frame(as.matrix(sparsedMatrix), check.names=F)
  return(NonSparse)
}

##
# read_list_dfs: Takes two named lists of counts and coordinates and performs sorting
# to get count and coordinate data frames in the same order
# @param rnacounts, a named list with counts
# @param spotcoords, a named list with coordinates
# @return return_lists a list with two lists within (one with counts, one with coordinates)
#
read_list_dfs = function(rnacounts, spotcoords){
  # Check that RNA and coord lists have the same number of elements.
  if(length(rnacounts) != length(spotcoords)){
    stop('The RNA count and coordinate do not have the same number of elements.')
  }
  # Test that names within the lists are the same.
  if(length(setdiff(names(rnacounts), names(spotcoords))) != 0){
    stop('The RNA count and coordinate lists do not have the same names.')
  }
  # Sort lists' names to order both.
  sorted_names = sort(names(rnacounts))

  # Process duplicate gene names
  # NOTE: Duplicate genes coming from matrices are modified with `make.unique`, but
  # are not comparable across matrices because unique identifiers for the probes are not used.
  for(i in 1:length(rnacounts)){
    if(any(duplicated(rnacounts[[i]][[1]]))){
      rnacounts[[i]][[1]] = make.unique(rnacounts[[i]][[1]])
    }
  }

  # Create list to be returned
  return_lists = list()
  return_lists[['counts']] = rnacounts[sorted_names]
  return_lists[['coords']] = spotcoords[sorted_names]
  return(return_lists)
}

##
# read_seurat: Takes a Seurat object and converts it to a STlist.
# @param rnacounts, a seurat object with spatial slot
# @return return_lists a list with two lists within (one with counts, one with coordinates)
#
read_seurat = function(rnacounts){
  # Create list to be returned
  return_lists = list()
  return_lists[['counts']] = list()
  return_lists[['coords']] = list()
  # Find assays within Seurat object
  slices = names(rnacounts@images)
  counts_mtx = as.data.frame(rnacounts@assays$Spatial@counts)
  # Extract counts and coordinates from each assay
  for(i in slices){
    spots = rownames(rnacounts@images[[i]]@coordinates)
    return_lists[['coords']][[i]] = rnacounts@images[[i]]@coordinates[, c('imagerow', 'imagecol')]
    return_lists[['coords']][[i]] = tibble::rownames_to_column(return_lists[['coords']][[i]], var='libname')
    return_lists[['coords']][[i]] = tibble::as_tibble(return_lists[['coords']][[i]])

    return_lists[['counts']][[i]] = counts_mtx[, colnames(counts_mtx) %in% spots]
    return_lists[['counts']][[i]] = tibble::rownames_to_column(return_lists[['counts']][[i]], var='gene')
    return_lists[['counts']][[i]] = tibble::as_tibble(return_lists[['counts']][[i]])

    return_lists[['images']][[i]] = rnacounts@images[[i]]@image
  }
  return(return_lists)
}

##
# read_matrices_fps: Takes a list with file paths to count and coordinate matrices
# and sample names and returns a list with count and coordinates data frames per sample
# @param filepaths a list with file paths to Visium outputs and sample IDs
# @param input_check The result of detect_input
# @return return_lists a list with two lists within (one with counts, one with coordinates)
#
read_matrices_fps = function(filepaths, input_check, cores=NULL){
  # Get delimiter of RNA counts and coordinates.
  # NOTE: This only test the first matrix.
  # NOTE2: This is being commented out because presumably, the detect_input function
  # already provides de delimiter to use
  if(!is.null(input_check$rna) & !is.null(input_check$coords)){
    delrna = input_check$rna[2]
    delcoords = input_check$coords[2]
    if(!(delrna %in% c(',', '\t')) | !(delcoords %in% c(',', '\t'))){
      stop('Counts or coordinate files are not comma or tab- delimited') # Print error if files were not detected to be comma- or tab-delimited
    }
  } else{ # If delimiter is not available (for example because only sample file provided), find delimiter
    count_testline = readLines(filepaths[['count_found']][1], n=1)
    if(grepl('\t', count_testline)){
      delrna = '\t'
    } else if(grepl(',', count_testline)){
      delrna = ','
    } else{
      stop('RNA counts are not comma or tab- delimited')
    }
    coord_testline = readLines(filepaths[['coord_found']][1], n=1)
    if(grepl('\t', coord_testline)){
      delcoords = '\t'
    } else if(grepl(',', coord_testline)){
      delcoords = ','
    } else{
      stop('Sport coordinates are not comma or tab- delimited')
    }
  }

  # Use parallelization to read count data if possible.
  counts_df_list = parallel::mclapply(seq_along(filepaths[['count_found']]), function(i){
    # Read filepaths.
    #counts_df = readr::read_delim(filepaths[['count_found']][i], delim=delrna, col_types=readr::cols(), progress=F)
    counts_df = data.table::fread(filepaths[['count_found']][i], sep=delrna)
    counts_df = as.data.frame(counts_df)
    return(counts_df)
  }, mc.cores=cores, mc.preschedule=F)
  # Name list elements.
  names(counts_df_list) = filepaths[['sampleids']]

  # Use parallelization to read coordinate data if possible.
  coords_df_list = parallel::mclapply(seq_along(filepaths[['coord_found']]), function(i){
    # Read filepaths.
    coords_df = readr::read_delim(filepaths[['coord_found']][i], delim=delcoords, col_types=readr::cols(), progress=F)
    return(coords_df)
  }, mc.cores=cores, mc.preschedule=F)
  # Name list elements.
  names(coords_df_list) = filepaths[['sampleids']]

  rm(delrna, delcoords) # Clean environment

  # Process duplicate gene names
  # NOTE: Duplicate genes coming from matrices are modified with `make.unique`, but
  # are not comparable across matrices because unique identifiers for the probes are not used.
  for(i in 1:length(counts_df_list)){
    if(any(duplicated(counts_df_list[[i]][[1]]))){
      counts_df_list[[i]][[1]] = make.unique(counts_df_list[[i]][[1]])
    }
  }

  return_lists = list()
  return_lists[['counts']] = counts_df_list
  return_lists[['coords']] = coords_df_list
  return(return_lists)
}

##
# read_visium_outs: Takes a list with Visium output file paths and sample names and
# returns a list with count and coordinates data frames per sample
# @param filepaths a list with file paths to Visium outputs and sample IDs
# @param input_rnacounts The result of detect_input_rnacounts
# @return a list with two lists within (one with counts, one with coordinates)
#
read_visium_outs = function(filepaths, input_rnacounts, cores=NULL){

  # To prevent NOTES in R CMD check
  . = NULL

  # Find necessary files from visium input
  missingSamples = 0
  fp_list = list()
  for(i in 1:length(filepaths[['count_found']])){
    fp_list[[i]] = list()
    # Get all system paths within output folder (h5/MEX, coordinates, image)
    temp_fps = list.files(filepaths[['count_found']][i], recursive=T, include.dirs=T, full.names=T)

    vimage = grep('spatial\\/', temp_fps, value=T) %>%
      grep('tissue_hires_image.png|tissue_lowres_image.png', ., value=T)
    vjson = grep('spatial\\/', temp_fps, value=T) %>%
      grep('scalefactors_json.json$', ., value=T)
    vcoords = grep('spatial\\/', temp_fps, value=T) %>%
      grep('tissue_positions_list.csv|tissue_positions.csv', ., value=T)
    # Filter out 'SPATIAL_RNA_COUNTER' folders (intermediate files from Space Ranger?).
    vcoords = vcoords[!grepl('SPATIAL_RNA_COUNTER', vcoords)]

    if(input_rnacounts[1] %in% c('visium_filtered_mex', 'visium_raw_mex')){
      if(input_rnacounts[1] == 'visium_filtered_mex'){
        # NOTE: For MEX, assume files are inside a directory named 'filtered_feature_bc_matrix'
        vfeatures = grep('filtered_feature_bc_matrix\\/features.tsv.gz',  temp_fps, value=T)
        vbarcodes = grep('filtered_feature_bc_matrix\\/barcodes.tsv.gz', temp_fps, value=T)
        vcounts = grep('filtered_feature_bc_matrix\\/matrix.mtx.gz', temp_fps, value=T)
      } else if(input_rnacounts[1] == 'visium_raw_mex'){
        vfeatures = grep('raw_feature_bc_matrix\\/features.tsv.gz',  temp_fps, value=T)
        vbarcodes = grep('raw_feature_bc_matrix\\/barcodes.tsv.gz', temp_fps, value=T)
        vcounts = grep('raw_feature_bc_matrix\\/matrix.mtx.gz', temp_fps, value=T)
      }

      # Test that all files have been found.
      needed_mex_test = c(!grepl('gz', vfeatures), !grepl('gz', vbarcodes), !grepl('gz', vcounts), !grepl('csv', vcoords))
      if(any(needed_mex_test)){
        raise_err(err_code='error0035')
      }

      fp_list[[i]]$features = vfeatures
      fp_list[[i]]$barcodes = vbarcodes
      fp_list[[i]]$counts = vcounts

      if(rlang::is_empty(vfeatures)) cat(paste("Features for", filepaths$sampleids[i], "not able to be found..."))
      if(rlang::is_empty(vbarcodes)) cat(paste("Barcodes for", filepaths$sampleids[i], "not able to be found..."))
      if(rlang::is_empty(vcounts)) cat(paste("Counts for", filepaths$sampleids[i], "not able to be found..."))

      if(rlang::is_empty(vfeatures) | rlang::is_empty(vbarcodes) | rlang::is_empty(vcounts) | rlang::is_empty(vcoords)){
        fp_list[[i]] = list()
        missingSamples  = missingSamples + 1
      }

      rm(vfeatures, vbarcodes, vcounts) # Clean environment
    } else if(input_rnacounts[1] %in% c('visium_filtered_h5', 'visium_raw_h5')){
      if(input_rnacounts[1] == 'visium_filtered_h5'){
        h5counts = grep('filtered_feature_bc_matrix.h5', temp_fps, value=T)
      } else if(input_rnacounts[1] == 'visium_raw_h5'){
        h5counts = grep('raw_feature_bc_matrix.h5', temp_fps, value=T)
      }

      # Test that all files have been found.
      needed_h5_test = c(!grepl('\\.h5$', h5counts), !grepl('csv', vcoords))
      if(any(needed_h5_test)){
        raise_err(err_code='error0035')
      }

      fp_list[[i]]$counts = h5counts

      if(rlang::is_empty(vcoords)) cat(paste("Coordinates for", filepaths$sampleids[i], "not able to be found...\n"))
      if(rlang::is_empty(h5counts) | rlang::is_empty(vcoords)){
        fp_list[[i]] = list()
        missingSamples  = missingSamples + 1
      }
      rm(h5counts) # Clean environment
    }

    fp_list[[i]]$coords = vcoords
    fp_list[[i]]$image = vimage
    fp_list[[i]]$json = vjson
    fp_list[[i]]$runname = filepaths[['sampleids']][i]

    rm(temp_fps, vcoords, vimage) # Clean environment
  }

  cat(paste("\tFound", length(filepaths$sampleids)-missingSamples, "Visium samples\n"))

  # Define number of available cores to use.
  if(is.null(cores)){
    cores = count_cores(length(filepaths[['count_found']]))
  } else{
    cores = as.integer(cores)
    if(is.na(cores)){
      raise_err(err_code='error0036')
    }
  }

  # Use parallelization to read count data if possible.
  output_temp = parallel::mclapply(seq_along(filepaths[['count_found']]), function(i){
    if(length(fp_list[[i]]$counts) == 0){
      return(list())
    }
    system(sprintf('echo "%s"', paste0("\t\tProcessing Sample ", i, "....")))

    # Process Visium outputs
    if(input_rnacounts %in% c('visium_filtered_mex', 'visium_raw_mex')){
      visium_processed = import_visium(features_fp=fp_list[[i]][['features']],
                                       barcodes_fp=fp_list[[i]][['barcodes']],
                                       counts_fp=fp_list[[i]][['counts']],
                                       coords_fp=fp_list[[i]][['coords']])
    } else{
      visium_processed = import_visium_h5(counts_fp=fp_list[[i]][['counts']],
                                          coords_fp=fp_list[[i]][['coords']])
    }
    system(sprintf('echo "%s"', paste0("\t\tFinished data read Sample ", i)))

    return(visium_processed)
  }, mc.cores=cores, mc.preschedule=F)
  cat(paste("\tData read completed\n"))

  # Organize the paralellized output into corresponding lists.
  return_lists = list()
  return_lists[['counts']] = list()
  return_lists[['coords']] = list()
  return_lists[['images']] = list()
  return_lists[['json_scale']] = list()
  for(i in 1:length(output_temp)){
    return_lists[['counts']][[fp_list[[i]]$runname]] = output_temp[[i]]$rawcounts
    return_lists[['coords']][[fp_list[[i]]$runname]] = output_temp[[i]]$coords
    return_lists[['images']][[fp_list[[i]]$runname]] = NULL
    return_lists[['json_scale']][[fp_list[[i]]$runname]] = NULL
    # Test if image file is present
    if(!rlang::is_empty(fp_list[[i]]$image)){
      # Try reading image
      image_read = tryCatch({png::readPNG(fp_list[[i]]$image[1])},
                            error=function(e){return(e)})
      # Try reading image again using the second element if available
      if(any(class(image_read) == 'simpleError') & length(fp_list[[i]][['image']]) > 1){
        image_read = tryCatch({png::readPNG(fp_list[[i]]$image[2])},
                              error=function(e){return(e)})
      }
      # If no image reading was possible, do not output image
      if(any(class(image_read) == 'simpleError')){
        image_read = NULL
        warning('Tissue image could not be read. Is the image a PNG derived from Space Ranger?')
      }
      return_lists[['images']][[fp_list[[i]]$runname]] = image_read
    }
    # Test if JSON scaling factor file is present
    if(!rlang::is_empty(fp_list[[i]]$json)){
      # image_read = tryCatch({jsonlite::read_json(fp_list[[i]]$json[1])},
      #                       error=function(e){png::readPNG(fp_list[[i]]$image[2])})
      json_read = jsonlite::read_json(fp_list[[i]]$json[1])
      return_lists[['json_scale']][[fp_list[[i]]$runname]] = json_read
    } else{
      return_lists[['json_scale']][[fp_list[[i]]$runname]] = 'no_scaling_available'
    }
  }

  rm(fp_list) # Clean environment
  return(return_lists)
} # CLOSE read_visium_outs

##
# read_xenium_outs: Takes a list with Xenium output file paths and sample names and
# returns a list with count and coordinates data frames per sample
# @param filepaths a list with file paths to Xemium outputs and sample IDs
# @param input_rnacounts The result of detect_input_rnacounts
# @return a list with two lists within (one with counts, one with coordinates)
#
read_xenium_outs = function(filepaths, input_rnacounts, cores=NULL){
  # Find necessary files from Xenium input
  missingSamples = 0
  fp_list = list()
  for(i in 1:length(filepaths[['count_found']])){
    fp_list[[i]] = list()
    # Get all system paths within output folder (h5/MEX, coordinates, image)
    temp_fps = list.files(filepaths[['count_found']][i], recursive=T, include.dirs=T, full.names=T)

    vcoords = grep('cells.parquet$', temp_fps, value=T)
    # Filter out 'SPATIAL_RNA_COUNTER' folders (intermediate files from Space Ranger? Also in Xenium outputs?).
    #vcoords = vcoords[!grepl('SPATIAL_RNA_COUNTER', vcoords)]

    if(input_rnacounts[1] == 'xenium_mex'){
      vfeatures = grep('cell_feature_matrix\\/features.tsv.gz',  temp_fps, value=T)
      vbarcodes = grep('cell_feature_matrix\\/barcodes.tsv.gz', temp_fps, value=T)
      vcounts = grep('cell_feature_matrix\\/matrix.mtx.gz', temp_fps, value=T)

      # Test that all files have been found.
      needed_mex_test = c(!grepl('gz', vfeatures), !grepl('gz', vbarcodes), !grepl('gz', vcounts), !grepl('csv', vcoords))
      if(any(needed_mex_test)){
        raise_err(err_code='error0037')
      }

      fp_list[[i]]$features = vfeatures
      fp_list[[i]]$barcodes = vbarcodes
      fp_list[[i]]$counts = vcounts

      if(rlang::is_empty(vfeatures)) cat(paste("Features for", filepaths$sampleids[i], "not able to be found..."))
      if(rlang::is_empty(vbarcodes)) cat(paste("Barcodes for", filepaths$sampleids[i], "not able to be found..."))
      if(rlang::is_empty(vcounts)) cat(paste("Counts for", filepaths$sampleids[i], "not able to be found..."))

      if(rlang::is_empty(vfeatures) | rlang::is_empty(vbarcodes) | rlang::is_empty(vcounts) | rlang::is_empty(vcoords)){
        fp_list[[i]] = list()
        missingSamples  = missingSamples + 1
      }

      rm(vfeatures, vbarcodes, vcounts) # Clean environment
    } else if(input_rnacounts[1] == 'xenium_h5'){
      h5counts = grep('cell_feature_matrix.h5', temp_fps, value=T)

      # Test that all files have been found.
      needed_h5_test = c(!grepl('\\.h5$', h5counts), !grepl('\\.parquet|\\.csv', vcoords))
      if(any(needed_h5_test)){
        raise_err(err_code='error0037')
      }

      fp_list[[i]]$counts = h5counts

      if(rlang::is_empty(vcoords)) cat(paste("Coordinates for", filepaths$sampleids[i], "not able to be found...\n"))
      if(rlang::is_empty(h5counts) | rlang::is_empty(vcoords)){
        fp_list[[i]] = list()
        missingSamples  = missingSamples + 1
      }
      rm(h5counts) # Clean environment
    }

    fp_list[[i]]$coords = vcoords
    fp_list[[i]]$runname = filepaths[['sampleids']][i]

    rm(temp_fps, vcoords) # Clean environment
  }

  cat(paste("\tFound", length(filepaths$sampleids)-missingSamples, "Xenium samples\n"))

  # Define number of available cores to use.
  if(is.null(cores)){
    cores = count_cores(length(filepaths[['count_found']]))
  } else{
    cores = as.integer(cores)
    if(is.na(cores)){
      raise_err(err_code='error0038')
    }
  }

  # Use parallelization to read count data if possible.
  output_temp = parallel::mclapply(seq_along(filepaths[['count_found']]), function(i){
    if(length(fp_list[[i]]$counts) == 0){
      return(list())
    }
    system(sprintf('echo "%s"', paste0("\t\tProcessing Sample ", i, "....")))

    # Process Visium outputs
    if(input_rnacounts[1] == 'xenium_mex'){
      xenium_processed = import_xenium_mex(features_fp=fp_list[[i]][['features']],
                                       barcodes_fp=fp_list[[i]][['barcodes']],
                                       counts_fp=fp_list[[i]][['counts']],
                                       coords_fp=fp_list[[i]][['coords']])
    } else{
      xenium_processed = import_xenium_h5(counts_fp=fp_list[[i]][['counts']],
                                          coords_fp=fp_list[[i]][['coords']])
    }
    system(sprintf('echo "%s"', paste0("\t\tFinished data read Sample ", i)))

    return(xenium_processed)
  }, mc.cores=cores, mc.preschedule=F)
  cat(paste("\tData read completed\n"))

  # Organize the paralellized output into corresponding lists.
  return_lists = list()
  return_lists[['counts']] = list()
  return_lists[['coords']] = list()
  for(i in 1:length(output_temp)){
    return_lists[['counts']][[fp_list[[i]]$runname]] = output_temp[[i]]$rawcounts
    return_lists[['coords']][[fp_list[[i]]$runname]] = output_temp[[i]]$coords
  }

  rm(fp_list) # Clean environment
  return(return_lists)
} # CLOSE read_xenium_outs

##
# read_cosmx_input: Takes a list with CosMx-SMI output file paths and sample names and
# returns a list with count and coordinates data frames per sample
# @param filepaths a list with file paths to CosMx-SMI outputs and sample IDs
# @param input_check The result of detect_input
# @return return_lists a list with two lists within (one with counts, one with coordinates)
#
read_cosmx_input = function(filepaths, input_check, cores=NULL){
  # Verify that the files contain the necessary columns: fov, cell_ID, CenterX_local_px, and CenterY_local_px
  missingSamples = 0
  fp_list = list()
  for(i in 1:length(filepaths[['count_found']])){
    # Does RNA count file have the required columns?
    firstline = readLines(filepaths[['count_found']][[i]], n=1)
    rna_cols = (grepl('fov', firstline) & grepl('cell_ID', firstline))
    rm(firstline)

    # Does coordinate file have the required columns?
    firstline = readLines(filepaths[['coord_found']][[i]], n=1)
    coord_cols = (grepl('fov', firstline) & grepl('cell_ID', firstline) & grepl('CenterX_local_px', firstline) & grepl('CenterY_local_px', firstline))
    rm(firstline)

    # If columns are not present in files, count as missing
    fp_list[[i]] = list()
    if(!rna_cols | !coord_cols){
      missingSamples  = missingSamples + 1
    } else{
      fp_list[[i]][['counts']] = filepaths[['count_found']][[i]]
      fp_list[[i]][['coords']] = filepaths[['coord_found']][[i]]
      fp_list[[i]][['runname']] = filepaths[['sampleids']][[i]]
    }
  }

  cat(paste("Found", length(filepaths$sampleids)-missingSamples, "CosMx-SMI samples\n"))

  # Use parallelization to read count data if possible.
  output_temp = parallel::mclapply(seq_along(1:length(fp_list)), function(i){
    if(length(fp_list[[i]][['counts']]) == 0){
      return(list())
    }
    system(sprintf('echo "%s"', paste0("\tProcessing sample ", i, "....")))

    # Process CosMx outputs.
    cosmx_processed = import_smi(counts_fp=fp_list[[i]][['counts']],
                                 coords_fp=fp_list[[i]][['coords']],
                                 slidename=fp_list[[i]][['runname']])

    system(sprintf('echo "%s"', paste0("\tFinished data read sample ", i)))
    return(cosmx_processed)
  }, mc.cores=cores, mc.preschedule=F)
  cat("\tData read completed\n")

  # Organize the paralellized output into corresponding lists.
  return_lists = list()
  return_lists[['counts']] = list()
  return_lists[['coords']] = list()
  for(i in 1:length(output_temp)){
    return_lists[['counts']] = append(return_lists[['counts']], output_temp[[i]][['rawcounts']])
    return_lists[['coords']] = append(return_lists[['coords']], output_temp[[i]][['coords']])
  }

  rm(fp_list) # Clean environment
  return(return_lists)
}


##
# process_lists: Takes two named lists of counts and coordinates and process gene and
# spot names before placing them within an STlist
# @param rnacounts, a named list with counts
# @param spotcoords, a named list with coordinates
# @return return_lists a list with two lists within (one with counts, one with coordinates)
#
process_lists = function(counts_df_list, coords_df_list){
  # Keep track of FOVs with zero counts
  rm_fov = c()
  # Process the count and coordinate list.
  for(i in 1:length(names(counts_df_list))){

    # Get loop current name to be processed.
    name_i = names(counts_df_list)[i]

    # One of the test data sets has column names begin with a number. To avoid this to
    # become a problem for data sets with the same issue, will add an 'x', similar to what R would do
    if(any(grepl("^[0-9]", colnames(counts_df_list[[name_i]])))){
      colnames(counts_df_list[[name_i]])[-1] = paste0('x', colnames(counts_df_list[[name_i]][, -1]))
      # Make the same addition in coordinate data to match column names in count data
      coords_df_list[[name_i]][, 1] = paste0('x', coords_df_list[[name_i]][[1]])
    }

    # Test that spot names are the same in both count and coordinate data frames.
    if(length(setdiff(colnames(counts_df_list[[name_i]])[-1], unlist(coords_df_list[[name_i]][, 1]))) != 0){
      raise_err(err_code='error0002', samplename=name_i)
    }

    array_col = names(coords_df_list[[name_i]])[3]
    # Sort coordinate data according to third column in the coordinate data frame.
    coords_df_list[[name_i]] = coords_df_list[[name_i]] %>% dplyr::arrange(!!array_col)

    # Convert to sparse matrix if not one
    if(!any(class(counts_df_list[[name_i]]) == 'dgCMatrix')){
      counts_df_list[[name_i]] = as.data.frame(counts_df_list[[name_i]])
      counts_df_list[[name_i]] = makeSparse(counts_df_list[[name_i]])
    }

    # Order column names in count data frame according to sorted coordinate data.
    if(!all(as.vector(coords_df_list[[name_i]][[1]]) %in% colnames(counts_df_list[[name_i]]))){
      warning('Not all spots/cells in coordinates are present in count data. Removing spots/cells not in counts data') # WARNING IF COUNTS ARE SUBSET OF COORDINATES
      coords_df_list[[name_i]] = coords_df_list[[name_i]][as.vector(coords_df_list[[name_i]][[1]]) %in% colnames(counts_df_list[[name_i]]), ]
    }
    counts_df_list[[name_i]] = counts_df_list[[name_i]][, as.vector(coords_df_list[[name_i]][[1]])]

    # Put column names to coordinate data (if not already there)
    if(sum(grepl('libname|ypos|xpos', colnames(coords_df_list[[name_i]]))) != 3){
      colnames(coords_df_list[[name_i]]) = c('libname', 'ypos', 'xpos')
    }
    # Force numeric to 2 and 3 column of coordinates table to ensure coordinates can be used
    coords_df_list[[name_i]][[2]] = as.numeric(coords_df_list[[name_i]][[2]])
    coords_df_list[[name_i]][[3]] = as.numeric(coords_df_list[[name_i]][[3]])

    # Get total gene counts and genes with no-zero counts
#    if(class(counts_df_list[[name_i]])[1] == 'dgCMatrix'){
      coords_df_list[[name_i]][['total_counts']] = as.vector(colSums(as.matrix(counts_df_list[[name_i]])))
      coords_df_list[[name_i]][['total_genes']] = as.vector(colSums(as.matrix(counts_df_list[[name_i]]) != 0))
#    } else{
#      coords_df_list[[name_i]][['total_counts']] = colSums(counts_df_list[[name_i]][, -1])
#      coords_df_list[[name_i]][['total_genes']] = colSums(counts_df_list[[name_i]][, -1] != 0)
#    }

    # If no counts in the entire FOV, mark for removal
    if(sum(coords_df_list[[name_i]][['total_counts']]) < 1){
      rm_fov = append(rm_fov, name_i)
      warning(paste0('No counts present in FOV ', name_i, '. Removing from data set.\n'))
    }
  }

  # Remove FOVs without counts
  if(length(rm_fov) > 0){
    counts_df_list = counts_df_list[ !(names(counts_df_list) %in% rm_fov) ]
    coords_df_list = coords_df_list[ !(names(coords_df_list) %in% rm_fov) ]
  }

  proc_return_lists = list()
  proc_return_lists[['counts']] = counts_df_list
  proc_return_lists[['coords']] = coords_df_list
  return(proc_return_lists)
}

##
# process_sample_filepaths: Gets file paths to Visium outputs or count/coord matrices
# from a sample metadata file
# @param samples, the file path to metadata file
# @param input_check, a list resulting from detect_input
# @return filepaths, a list with found file paths per each sample, and sample IDs
#
process_sample_filepaths = function(samples, input_check){
  # Check that all system paths provided exist.
  if(!file.exists(samples)){
    stop('Could not find samples file.')
  }
  # Read sample file and get sample names.
  delsample = input_check$samples[2]
  sample_file_df = readr::read_delim(samples, delim=delsample, col_types=readr::cols(), progress=F, show_col_types=F)
  sample_names = sample_file_df[[1]]
  # Create objects to store sample names found. Will be used to assess if expected samples exist.
  # Also store file paths to open
  filepaths = list()
  if(input_check$samples[1] %in% c('samplesfile_visium_h5', 'samplesfile_visium_mex')){
    for(i in sample_names){
      sample_i = as.vector(unlist(sample_file_df[, 2][grep(i, sample_file_df[[1]]), ]))
      filepaths[['count_found']] = append(filepaths[['count_found']], sample_i[1])
    }
    filepaths[['sampleids']] = sample_names
  } else{
    for(i in sample_names){
      sample_i = as.vector(unlist(sample_file_df[, c(2, 3)][grep(i, sample_file_df[[1]]), ]))
      filepaths[['count_found']] = append(filepaths[['count_found']], sample_i[1])
      filepaths[['coord_found']] = append(filepaths[['coord_found']], sample_i[2])
    }
    filepaths[['sampleids']] = sample_names
  }
  return(filepaths)
}

##
# process_sample_names: Match file paths to Visium outputs or count/coord matrices
# to sample names for processing
# @param rnacounts, vector of file paths to Visium outputs or count matrices
# @param spotcoords, vector of file paths to coordinate matrices
# @param samples, vector of sample names
# @param input_rnacounts, a string resulting from detect_input_rnacounts
# @param input_spotcoords, a string resulting from detect_input_spotcoords
# @return a list with found file paths per each sample, and sample IDs
#
process_sample_names = function(rnacounts, spotcoords, samples, input_rnacounts, input_spotcoords){
  # Create objects to store sample names found. Will be used to assess if expected samples exist.
  # Also store paths to files
  filepaths = list()
  vis_xen = c('visium_filtered_h5', 'visium_raw_h5', 'visium_filtered_mex', 'visium_raw_mex', 'xenium_h5', 'xenium_mex')
  delim = c('tab_delim', 'comma_delim', 'tab_delim_cosmx', 'comma_delim_cosmx')
  if(input_rnacounts %in% vis_xen){
    for(i in samples){
      sample_i = grep(i, rnacounts, value=T, fixed=T)
      if(length(sample_i) == 1){
        filepaths[['count_found']] = append(filepaths[['count_found']], sample_i)
        filepaths[['sampleids']] = append(filepaths[['sampleids']], i)
      } else if(length(sample_i) > 1){
        raise_err(err_code='error0001')
      } else{
        warning(paste0('Sample ', i, ' was not found among the Visium or Xenium outputs.'))
      }
    }
  } else if(input_rnacounts %in% delim && input_spotcoords %in% delim){
    for(i in samples){
      sample_count = grep(i, rnacounts, value=T, fixed=T)
      sample_coord = grep(i, spotcoords, value=T, fixed=T)
      if(length(sample_count) != 0 & length(sample_coord) != 0){
        filepaths[['count_found']] = append(filepaths[['count_found']], sample_count)
        filepaths[['coord_found']] = append(filepaths[['coord_found']], sample_coord)
        filepaths[['sampleids']] = append(filepaths[['sampleids']], i)
      } else{
        warning(paste0('Sample ', i, ' was not found among the count/coordinate files.'))
      }
    }
  } else{
    raise_err(err_code='error0034')
  }
  return(filepaths)
}

##
# process_sample_names_from_file: Match file paths to Visium outputs or count/coord matrices
# to sample names from a meta data file for later processing
# @params rnacounts, vector of file paths to Visium outputs or count matrices
# @params spotcoords, vector of file paths to coordinate matrices
# @param samples, the file path to metadata file
# @param input_check, a list resulting from detect_input
# @return filepaths, a list with found file paths per each sample, and sample IDs
#
process_sample_names_from_file = function(rnacounts, spotcoords, samples, input_check){
  # Check that all system paths provided exist.
  if(!file.exists(samples)){
    stop('Could not find samples file.')
  }
  # Read sample file and get sample names.
  delsample = input_check$samples[2]
  sample_file_df = readr::read_delim(samples, delim=delsample, col_types=readr::cols(), progress=F, show_col_types=F, name_repair='minimal')
  sample_names = sample_file_df[[1]]
  # Create objects to store sample names found. Will be used to assess if expected samples exist.
  # Also store file paths to open
  filepaths = list()
  if(!is.null(input_check$rna)){
    if(input_check$rna[1] %in% c('visium_out_h5', 'visium_out_mex')){
      for(i in sample_names){
        sample_i = grep(i, rnacounts, value=T)
        if(length(sample_i) != 0){
          filepaths[['count_found']] = append(filepaths[['count_found']], sample_i)
          filepaths[['sampleids']] = append(filepaths[['sampleids']], i)
        }
      }
    }
  } else if(input_check$samples[1] %in% c('samplesfile_visium_h5', 'samplesfile_visium_mex')){
    for(i in sample_names){
      visium_fps = sample_file_df[[2]] # If only samples file provided, then visium directories must be in second column of file
      sample_i = grep(i, visium_fps, value=T)
      if(length(sample_i) != 0){
        filepaths[['count_found']] = append(filepaths[['count_found']], sample_i)
        filepaths[['sampleids']] = append(filepaths[['sampleids']], i)
      }
    }
  } else{
    for(i in sample_names){
      # Get file paths to counts and coords from samples file
      rna_fps = sample_file_df[[2]]
      coords_fp = sample_file_df[[3]]
      sample_count = grep(i, rna_fps, value=T)
      sample_coord = grep(i, coords_fp, value=T)
      if(length(sample_count) != 0 & length(sample_coord) != 0){
        filepaths[['count_found']] = append(filepaths[['count_found']], sample_count)
        filepaths[['coord_found']] = append(filepaths[['coord_found']], sample_coord)
        filepaths[['sampleids']] = append(filepaths[['sampleids']], i)
      }
    }
  }
  return(filepaths)
}

##
# process_meta: Takes the input file path of a metadata table and return the processed
# data frame for the STlist
# @param samples, the file path to metadata file
# @param input_check, a list resulting from detect_input
# @param counts_df_list, a list with counts resulting from process_lists
# @return samples_df, a data frame to be placed within the clinical slot of the STlist
#
process_meta = function(samples, input_check, counts_df_list){

  # To prevent NOTES in R CMD check
  . = NULL

  # Get delimiter of file from input_check
  del = input_check[['samples']][2]
  # Read file.
  samples_df = readr::read_delim(samples, delim=del, col_types=readr::cols(), progress=F, show_col_types=F)
  # Remove leading spaces in columan names if any
  colnames(samples_df) = gsub('^ ', '', colnames(samples_df))

  # Get sorted list names and fetch corresponding rows from the samplefile.
  sorted_names = sort(names(counts_df_list))

  # Check if all sample names in metadata table are present
  # If CosMx, expand metadata to FOV level
  if(!is.null(input_check[['rna']])){
    if(input_check[['rna']][[1]] == 'cosmx'){
      samples_df_fov = tibble::tibble(fovid=sorted_names)
      samples_df_fov[[colnames(samples_df)[[1]]]] = stringr::str_extract(sorted_names, paste0('^', samples_df[[1]], collapse='|'))
      samples_df = samples_df %>% dplyr::left_join(samples_df_fov, ., by=colnames(samples_df)[[1]])
      colnames(samples_df)[c(1,2)] = c('sample_name', 'slide_name')
      rm(samples_df_fov) # Clean env
    } else{
      samples_df = samples_df[samples_df[[1]] %in% sorted_names, ]
    }
  } else{
    samples_df = samples_df[samples_df[[1]] %in% sorted_names, ]
  }

  # Remove file paths from sample file if existent
  if(input_check$samples[1] %in% c('samplesfile_visium_h5', 'samplesfile_visium_mex')){
    samples_df = samples_df[, -2]
  } else if(input_check$samples[1] == 'samplesfile_matrices'){
    samples_df = samples_df[, -c(2:3)]
  }
  # Test that number of rows in sample file correspond to number of elements in list.
  if(nrow(samples_df) != length(counts_df_list)){
    stop("The number of rows in the sample data is not the same as the number of spatial arrays.")
  }
  return(samples_df)
}

# process_meta_geomx = function(samples, input_check, counts_df_list, gmx_slide_col, gmx_meta_cols){
#   if(input_check$samples[2] == 'xls'){
#     samples_df = readxl::read_excel(samples)
#   } else {
#     # Get delimiter of file from input_check
#     del = input_check$samples[2]
#     samples_df = readr::read_delim(samples, delim=del, col_types=readr::cols(), progress=F, show_col_types=F)
#   }
#   samples_df = samples_df[samples_df[[gmx_slide_col]] %in% names(counts_df_list), ]
#   samples_df = samples_df[, c(gmx_slide_col, gmx_meta_cols)]
#
#   samples_df_summ = tibble::tibble()
#   for(slide_id in unique(samples_df[[gmx_slide_col]])){
#     meta_row = c(sampleID=slide_id)
#     for(metacol in gmx_meta_cols){
#       if(is.numeric(samples_df[[metacol]])){
#         tmp_val = mean(samples_df[[metacol]][samples_df[[gmx_slide_col]] == slide_id], na.rm=T)
#       } else{
#         tmp_val = unique(samples_df[[metacol]][samples_df[[gmx_slide_col]] == slide_id])
#         tmp_val = paste(tmp_val, collapse='|')
#       }
#       names(tmp_val) = metacol
#       meta_row = append(meta_row, tmp_val)
#     }
#     samples_df_summ = dplyr::bind_rows(samples_df_summ, meta_row)
#   }
#   return(samples_df_summ)
# }
FridleyLab/spatialGE documentation built on April 14, 2025, 9:37 a.m.