formatTxSpots: Read and process transcript spots geometry for SFE

View source: R/formatTxSpots.R

formatTxSpotsR Documentation

Read and process transcript spots geometry for SFE

Description

The function 'formatTxSpots' reads the transcript spot coordinates of smFISH-based data and formats the data. The data is not added to an SFE object. If the file specified in 'file_out' already exists, then this file will be read instead of the original file in the 'file' argument, so the processing is not run multiple times. The function 'addTxSpots' adds the data read and processed in 'formatTxSpots' to the SFE object, and reads all transcript spot data. To only read a subset of transcript spot data, first use 'formatTxSpots' to write the re-formatted data to disk. Then read the specific subset and add them separately to the SFE object with the setter functions.

Usage

formatTxSpots(
  file,
  dest = c("rowGeometry", "colGeometry"),
  spatialCoordsNames = c("global_x", "global_y", "global_z"),
  gene_col = "gene",
  cell_col = "cell_id",
  z = "all",
  phred_col = "qv",
  min_phred = 20,
  split_col = NULL,
  not_in_cell_id = c("-1", "UNASSIGNED"),
  z_option = c("3d", "split"),
  flip = FALSE,
  file_out = NULL,
  BPPARAM = SerialParam(),
  return = TRUE
)

addTxSpots(
  sfe,
  file,
  sample_id = 1L,
  spatialCoordsNames = c("global_x", "global_y", "global_z"),
  gene_col = "gene",
  z = "all",
  phred_col = "qv",
  min_phred = 20,
  split_col = NULL,
  z_option = c("3d", "split"),
  flip = FALSE,
  file_out = NULL,
  BPPARAM = SerialParam()
)

Arguments

file

File with the transcript spot coordinates. Should be one row per spot when read into R and should have columns for coordinates on each axis, gene the transcript is assigned to, and optionally cell the transcript is assigned to. Must be csv, tsv, or parquet.

dest

Where in the SFE object to store the spot geometries. This affects how the data is processed. Options:

rowGeometry

All spots for each gene will be a 'MULTIPOINT' geometry, regardless of whether they are in cells or which cells they are assigned to.

colGeometry

The spots for each gene assigned to a cell of interest will be a 'MULTIPOINT' geometry; since the gene count matrix is sparse, the geometries are NOT returned to memory.

spatialCoordsNames

Column names for the x, y, and optionally z coordinates of the spots. The defaults are for Vizgen.

gene_col

Column name for genes.

cell_col

Column name for cell IDs, ignored if 'dest = "rowGeometry"'. Can have length > 1 when multiple columns are needed to uniquely identify cells, in which case the contents of the columns will be concatenated, such as in CosMX data where cell ID is only unique within the same FOV. Default "cell_id" is for Vizgen MERFISH. Should be 'c("cell_ID", "fov")' for CosMX.

z

Index of z plane to read. Can be "all" to read all z-planes into MULTIPOINT geometries with XYZ coordinates. If z values are not integer, then spots with all z values will be read.

phred_col

Column name for Phred scores of the spots.

min_phred

Minimum Phred score to keep spot. By default 20, the conventional threshold indicating "acceptable", meaning that there's 1 chance that the spot was decoded in error.

split_col

Categorical column to split the geometries, such as cell compartment the spots are assigned to as in the "CellComp" column in CosMX output.

not_in_cell_id

Value of cell ID indicating that the spot is not assigned to any cell, such as "-1" in Vizgen MERFISH and "0" in CosMX. When there're multiple columns for 'cell_col', the first column is used to identify spots that are not in cells.

z_option

What to do with z coordinates. "3d" is to construct 3D geometries. "split" is to create a separate 2D geometry for each z-plane so geometric operations are fully supported but some data wrangling is required to perform 3D analyses. When the z coordinates are not integers, 3D geometries will always be constructed since there are no z-planes to speak of. This argument does not apply when 'spatialCoordsNames' has length 2.

flip

Logical, whether to flip the geometry to match image. Here the y coordinates are simply set to -y, so the original bounding box is not preserved. This is consistent with readVizgen and readXenium.

file_out

Name of file to save the geometry or raster to disk. Especially when the geometries are so large that it's unwieldy to load everything into memory. If this file (or directory for multiple files) already exists, then the existing file(s) will be read, skipping the processing. When writing the file, extensions supplied are ignored and extensions are determined based on 'dest'.

BPPARAM

BiocParallelParam object to specify multithreading to convert raw char in some parquet files to R objects. Not used otherwise.

return

Logical, whether to return the geometries in memory. This does not depend on whether the geometries are written to file. Always 'FALSE' when 'dest = "colGeometry"'.

sfe

A 'SpatialFeatureExperiment' object.

sample_id

Which sample in the SFE object the transcript spots should be added to.

Value

A sf data frame for vector geometries if 'file_out' is not set. 'SpatRaster' for raster. If there are multiple files written, such as when splitting by cell compartment or when 'dest = "colGeometry"', then a directory with the same name as 'file_out' will be created (but without the extension) and the files are written to that directory with informative names. 'parquet' files that can be read with 'st_read' is written for vector geometries. When 'return = FALSE', the file name or directory (when there're multiple files) is returned.

The 'sf' data frame, or path to file where geometries are written if 'return = FALSE'.

Note

When 'dest = "colGeometry"', the geometries are always written to disk and not returned in memory, because this is essentially the gene count matrix, which is sparse. This kind of reformatting is implemented so users can read in MULTIPOINT geometries with transcript spots for each gene assigned to each cell for spatial point process analyses, where not all genes are loaded at once.

Examples

# Default arguments are for MERFISH
fp <- tempfile()
dir_use <- SFEData::VizgenOutput(file_path = fp)
g <- formatTxSpots(file.path(dir_use, "detected_transcripts.csv"))
unlink(dir_use, recursive = TRUE)

# For CosMX, note the colnames, also dest = "colGeometry"
# Results are written to the tx_spots directory
dir_use <- SFEData::CosMXOutput(file_path = fp)
cg <- formatTxSpots(file.path(dir_use, "Run5642_S3_Quarter_tx_file.csv"),
dest = "colGeometry", z = "all",
cell_col = c("cell_ID", "fov"),
gene_col = "target", not_in_cell_id = "0",
spatialCoordsNames = c("x_global_px", "y_global_px", "z"),
file_out = file.path(dir_use, "tx_spots"))
# Cleanup
unlink(dir_use, recursive = TRUE)

pachterlab/SpatialFeatureExperiment documentation built on Nov. 15, 2024, 1:46 a.m.