rehashSE: anonymize (to some degree) a SummarizedExperiment (or an...

Description Usage Arguments Details Value Examples

View source: R/rehashSE.R

Description

This is NOT cryptographically secure nor equivalent to a proper 2-key de-ID! It is likely to dissuade casual attack, but cannot stop a motivated attacker.

Usage

1
2
rehashSE(x, salt = "0x", strip = TRUE, algo = "md5",
  deorder = FALSE)

Arguments

x

a RangedSummarizedExperiment to anonymize

salt

a salting phrase to slow brute-force attacks ("0x")

strip

strip rehashed objects of any deID'ing metadata? (TRUE)

algo

algorithm to use for the one-way hash (default is "md5")

deorder

scramble rows and columns? (FALSE; disrupts data digest)

Details

Specialized functions for rehash'ing specialized SE-like objects and for providing key-exchangeable versions of this functionality are forthcoming.

Assay renaming currently works by matching the assay name to the actual hdf5 path name used in HDF5 backing files (assays.h5), as produced by HDF5Array::saveHDF5SummarizedExperiment(...). This should ease interop with e.g. Python consumers of the data (they'll still need reverse mappings for the column and row names, but that's not too terribly difficult either).

At some point, it may make more sense to save metadata for rehash/dehash purposes to a relatively language-agnostic data format like Feather, or else break up all the pieces into CSVs and write a Python package to handle the reversing of hash-mappings. Either should be fine for interop.

Value

1
         an object of the same class as x, with hashed dimnames

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
ncols <- 6
nrows <- 200
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
rownames(counts) <- apply(expand.grid(letters, letters), 1, 
                          paste0, collapse="")[seq_len(nrow(rse))]
rowRanges <- GRanges(rep(c("chr1", "chr2"), c(50, 150)),
                     IRanges(floor(runif(200, 1e5, 1e6)), width=100),
                     strand=sample(c("+", "-"), 200, TRUE),
                     feature_id=sprintf("ID%03d", 1:200))
names(rowRanges) <- rownames(counts) 
colData <- DataFrame(Treatment=rep(c("ChIP", "Input"), 3),
                     row.names=LETTERS[1:6])

# a toy RangedSummarizedExperiment (?SummarizedExperiment) 
rse <- SummarizedExperiment(assays=SimpleList(counts=counts),
                            rowRanges=rowRanges, colData=colData)
assays(rse)$cpm <- sweep(assays(rse)$counts * 1e6, 2, normalizers, `/`)
covs <- colData(rse) # alternative to pulling these from res$covs

# rehash the toy RangedSummarizedExperiment:
res <- rehash(rse, salt="testing", strip=TRUE, algo="md5", deorder=FALSE)
deIDed <- res$object

# test it out with HDF5-backed storage:
library(HDF5Array)
deIDedPath <- file.path(tempdir() , "deIDed") 
deIDed <- saveHDF5SummarizedExperiment(deIDed, deIDedPath, replace=TRUE)

# recover the rehashed object using the saved metadata:
meta <- res$meta
covs <- res$covs
reIDed <- dehash(deIDed, meta=meta, covs=covs, check=TRUE)

if (!is.null(colnames(rse))) {
  stopifnot(identical(colnames(reIDed), colnames(rse)))
} 

if (!is.null(rownames(rse))) {
  stopifnot(identical(rownames(reIDed), rownames(rse)))
} 

# seeing is believing
show(reIDed)

trichelab/rehash documentation built on Nov. 5, 2019, 10:58 a.m.