pseudoDB | R Documentation |
Pseudo-database class
Pseudo-database class
con
Database connection
project
Project block from config
config
Config block from config
files
File names in the configuration file
datalog
Dataframe with file statistics. Filled before and during processing.
new()
Make a new object of class 'pseudoDB'. When initializing, reads the configuration file, checks and makes the output directories specified in the configuration file, opens a connection to the sqlite ('shinto_pseudomaker.sqlite'), checks if all files in the config exist (if not, they are skipped later).
pseudoDB$new( config_file, secret, log_to = c("file", "stdout"), max_n_lines = NULL )
config_file
Path to the YML file with settings
secret
Secret key used for (extra) symmetric encryption
log_to
Log to a file or stdout (pertains to old logging in .log files, see shintopseudo.csv in the file output folder(s)).
max_n_lines
Max number of lines to read from the input files; used for testing only
create_directories()
Create output/log/sqlite directories if not exist
pseudoDB$create_directories()
open_logfile()
Opens a log file in the log output directory
pseudoDB$open_logfile()
write_datalog()
Writes shintopseudo.csv in the file output directory
pseudoDB$write_datalog()
log()
Logs to the old-style logging file
pseudoDB$log(msg, how = c("info", "fatal", "warn"))
msg
Logging message
how
Either info, fatal or warn
set_data_log()
Update a field in the datalog during processing
pseudoDB$set_data_log(file, what, value)
file
For which file to set the datalog
what
Set which field (column)
value
Set the value
set_status()
Set the status in the datalog (for e.g. errors)
pseudoDB$set_status(file, status)
file
Filename to set a status
status
Status to set
set_error()
Set an error in the data log for a file (and a timestamp)
pseudoDB$set_error(file, error)
file
Filename to flag an error
error
Error code
read_config()
Reads the config from a .yml/.yaml file
pseudoDB$read_config(fn)
fn
Path to yml
check_files_exist()
Check if all files mentioned in the config exist
pseudoDB$check_files_exist()
open_sqlite()
Opens a connection to the SQLite with 'DBI::dbConnect(RSQLite::SQLite()...)', prepares an empty 'datadienst' table in the database if it does not exist already.
pseudoDB$open_sqlite()
vacuum_sqlite()
Performs a vacuum on the SQLite. Automatically done before closing the connection.
pseudoDB$vacuum_sqlite()
from sqlite.org: "The VACUUM command rebuilds the database file, repacking it into a minimal amount of disk space [...] Frequent inserts, updates, and deletes can cause the database file to become fragmented - where data for a single table or index is scattered around the database file. Running VACUUM ensures that each table and index is largely stored contiguously within the database file.".
close_sqlite()
Close the DB connection and perform a vacuum
pseudoDB$close_sqlite(vacuum = TRUE)
vacuum
Whether to vacuum the SQLite or not. See $vacuum_sqlite method.
close()
Close everything (also the log file)
pseudoDB$close()
read_data()
Reads a file from the config. Includes multiple methods.
pseudoDB$read_data(fn)
fn
Bare filename to read (full path is read from config).
Normally $read_data_fread is used unless readmethod='json', in which case the config setting 'post_read_function' is applied to the result of 'jsonlite::fromJSON', so that you might attempt to flatten a JSON into a neat CSV.
read_data_fread()
Default method to read the CSV using 'data.table::fread'.
pseudoDB$read_data_fread(fn, quote, sep, fill, skip = 0, encoding = NULL)
fn
Filename WITH full path (unlike '$read_data')
quote
Argument 'quote' in fread()
sep
Argument 'sep' in fread()
fill
Argument 'fill' in fread()
skip
Argument 'skip' in fread()
encoding
Either UTF-8 or Latin-1 (or leave blank for 'unknown', which is not very reliable!)
write_data()
Writes an output CSV with 'data.table::fwrite'
pseudoDB$write_data(data, fn)
data
Dataframe
fn
Filename
encrypt()
Symmetrically encrypt a vector using the secret
pseudoDB$encrypt(x)
x
A character vector
decrypt()
Symmetrically decrypt an encrypted vector using the secret
pseudoDB$decrypt(x)
x
A character vector
symmetric_encrypt_columns()
Symmetric encryption for multiple columns at once
pseudoDB$symmetric_encrypt_columns(data, columns, new_names = NULL)
data
A Dataframe
columns
Vector of column names
new_names
Vector of new column names in the output dataframe (to be added in addition to the original).
make_hash()
The most basic function: making a 9-character hash used to make all pseudo-IDs.
pseudoDB$make_hash(n = 1, n_phrase = 9)
n
Number of hashes to make
n_phrase
Length of the hash (default = 9 chars)
anonymize_column()
Anonymize a column. This is the largest and most crucial method.
pseudoDB$anonymize_column( data, column, db_key = NULL, store_key_columns = NULL, normalise_key_columns = NULL, file = NULL )
data
Dataframe
column
Column name to hash
db_key
Key name of the column
store_key_columns
Special method; do not use.
normalise_key_columns
Add a normalized ASCII version of the column to the dataframe (special characters replaced with ASCII 'equivalents')
file
Unused argument; ignore
Replaces every value in the column of the dataframe with 'hashes', so that each same value in the data will get the same hash. Values already hashed will be read from the sqlite (so that the same hashes/value) combinations get made in each file, and each run of the process), values not previously hashed will get a new value/hash combination which is written to the sqlite.
anonymize_columns()
See $anonymize_column; this is the vectorized version for multiple columns
pseudoDB$anonymize_columns(data, columns, db_keys, file, ...)
data
See $anonymize_column
columns
See $anonymize_column
db_keys
See $anonymize_column
file
See $anonymize_column
...
Further passed to $anonymize_column
read_bag_extract()
Only used for a very specific case. Not further encouraged.
pseudoDB$read_bag_extract(path)
path
Filename
validate_address()
Only used in a very specific case. Not supported or encouraged.
pseudoDB$validate_address(data, column, columns_out, bag_path)
data
Dataframe
column
Column name
columns_out
Names of output columns
bag_path
Path to BAG file
process_files()
Run the entire process. Read files, anonymize, encrypt, write, log.
pseudoDB$process_files(files = NULL)
files
Optional vector of filenames to process, otherwise processes all in the loaded config.
date_to_year()
Specific for dd-mm-yyyy dates in the data; not configurable (and not used in any application)
pseudoDB$date_to_year(data, column)
data
Dataframe
column
Name of column
to_age_bracket()
Age in years to bracket (5-10, 10-15 etc.)
pseudoDB$to_age_bracket(data, columns)
data
Dataframe
columns
Name of columns
keep_columns()
Keep these columns
pseudoDB$keep_columns(data, columns)
data
Dataframe
columns
Name of columns
delete_columns()
Delete these columns
pseudoDB$delete_columns(data, columns)
data
Dataframe
columns
Name of columns
clone()
The objects of this class are cloneable with this method.
pseudoDB$clone(deep = FALSE)
deep
Whether to make a deep clone.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.