pseudoDB: Pseudo-database class

pseudoDBR Documentation

Pseudo-database class

Description

Pseudo-database class

Pseudo-database class

Public fields

con

Database connection

project

Project block from config

config

Config block from config

files

File names in the configuration file

datalog

Dataframe with file statistics. Filled before and during processing.

Methods

Public methods


Method new()

Make a new object of class 'pseudoDB'. When initializing, reads the configuration file, checks and makes the output directories specified in the configuration file, opens a connection to the sqlite ('shinto_pseudomaker.sqlite'), checks if all files in the config exist (if not, they are skipped later).

Usage
pseudoDB$new(
  config_file,
  secret,
  log_to = c("file", "stdout"),
  max_n_lines = NULL
)
Arguments
config_file

Path to the YML file with settings

secret

Secret key used for (extra) symmetric encryption

log_to

Log to a file or stdout (pertains to old logging in .log files, see shintopseudo.csv in the file output folder(s)).

max_n_lines

Max number of lines to read from the input files; used for testing only


Method create_directories()

Create output/log/sqlite directories if not exist

Usage
pseudoDB$create_directories()

Method open_logfile()

Opens a log file in the log output directory

Usage
pseudoDB$open_logfile()

Method write_datalog()

Writes shintopseudo.csv in the file output directory

Usage
pseudoDB$write_datalog()

Method log()

Logs to the old-style logging file

Usage
pseudoDB$log(msg, how = c("info", "fatal", "warn"))
Arguments
msg

Logging message

how

Either info, fatal or warn


Method set_data_log()

Update a field in the datalog during processing

Usage
pseudoDB$set_data_log(file, what, value)
Arguments
file

For which file to set the datalog

what

Set which field (column)

value

Set the value


Method set_status()

Set the status in the datalog (for e.g. errors)

Usage
pseudoDB$set_status(file, status)
Arguments
file

Filename to set a status

status

Status to set


Method set_error()

Set an error in the data log for a file (and a timestamp)

Usage
pseudoDB$set_error(file, error)
Arguments
file

Filename to flag an error

error

Error code


Method read_config()

Reads the config from a .yml/.yaml file

Usage
pseudoDB$read_config(fn)
Arguments
fn

Path to yml


Method check_files_exist()

Check if all files mentioned in the config exist

Usage
pseudoDB$check_files_exist()

Method open_sqlite()

Opens a connection to the SQLite with 'DBI::dbConnect(RSQLite::SQLite()...)', prepares an empty 'datadienst' table in the database if it does not exist already.

Usage
pseudoDB$open_sqlite()

Method vacuum_sqlite()

Performs a vacuum on the SQLite. Automatically done before closing the connection.

Usage
pseudoDB$vacuum_sqlite()
Details

from sqlite.org: "The VACUUM command rebuilds the database file, repacking it into a minimal amount of disk space [...] Frequent inserts, updates, and deletes can cause the database file to become fragmented - where data for a single table or index is scattered around the database file. Running VACUUM ensures that each table and index is largely stored contiguously within the database file.".


Method close_sqlite()

Close the DB connection and perform a vacuum

Usage
pseudoDB$close_sqlite(vacuum = TRUE)
Arguments
vacuum

Whether to vacuum the SQLite or not. See $vacuum_sqlite method.


Method close()

Close everything (also the log file)

Usage
pseudoDB$close()

Method read_data()

Reads a file from the config. Includes multiple methods.

Usage
pseudoDB$read_data(fn)
Arguments
fn

Bare filename to read (full path is read from config).

Details

Normally $read_data_fread is used unless readmethod='json', in which case the config setting 'post_read_function' is applied to the result of 'jsonlite::fromJSON', so that you might attempt to flatten a JSON into a neat CSV.


Method read_data_fread()

Default method to read the CSV using 'data.table::fread'.

Usage
pseudoDB$read_data_fread(fn, quote, sep, fill, skip = 0, encoding = NULL)
Arguments
fn

Filename WITH full path (unlike '$read_data')

quote

Argument 'quote' in fread()

sep

Argument 'sep' in fread()

fill

Argument 'fill' in fread()

skip

Argument 'skip' in fread()

encoding

Either UTF-8 or Latin-1 (or leave blank for 'unknown', which is not very reliable!)


Method write_data()

Writes an output CSV with 'data.table::fwrite'

Usage
pseudoDB$write_data(data, fn)
Arguments
data

Dataframe

fn

Filename


Method encrypt()

Symmetrically encrypt a vector using the secret

Usage
pseudoDB$encrypt(x)
Arguments
x

A character vector


Method decrypt()

Symmetrically decrypt an encrypted vector using the secret

Usage
pseudoDB$decrypt(x)
Arguments
x

A character vector


Method symmetric_encrypt_columns()

Symmetric encryption for multiple columns at once

Usage
pseudoDB$symmetric_encrypt_columns(data, columns, new_names = NULL)
Arguments
data

A Dataframe

columns

Vector of column names

new_names

Vector of new column names in the output dataframe (to be added in addition to the original).


Method make_hash()

The most basic function: making a 9-character hash used to make all pseudo-IDs.

Usage
pseudoDB$make_hash(n = 1, n_phrase = 9)
Arguments
n

Number of hashes to make

n_phrase

Length of the hash (default = 9 chars)


Method anonymize_column()

Anonymize a column. This is the largest and most crucial method.

Usage
pseudoDB$anonymize_column(
  data,
  column,
  db_key = NULL,
  store_key_columns = NULL,
  normalise_key_columns = NULL,
  file = NULL
)
Arguments
data

Dataframe

column

Column name to hash

db_key

Key name of the column

store_key_columns

Special method; do not use.

normalise_key_columns

Add a normalized ASCII version of the column to the dataframe (special characters replaced with ASCII 'equivalents')

file

Unused argument; ignore

Details

Replaces every value in the column of the dataframe with 'hashes', so that each same value in the data will get the same hash. Values already hashed will be read from the sqlite (so that the same hashes/value) combinations get made in each file, and each run of the process), values not previously hashed will get a new value/hash combination which is written to the sqlite.


Method anonymize_columns()

See $anonymize_column; this is the vectorized version for multiple columns

Usage
pseudoDB$anonymize_columns(data, columns, db_keys, file, ...)
Arguments
data

See $anonymize_column

columns

See $anonymize_column

db_keys

See $anonymize_column

file

See $anonymize_column

...

Further passed to $anonymize_column


Method read_bag_extract()

Only used for a very specific case. Not further encouraged.

Usage
pseudoDB$read_bag_extract(path)
Arguments
path

Filename


Method validate_address()

Only used in a very specific case. Not supported or encouraged.

Usage
pseudoDB$validate_address(data, column, columns_out, bag_path)
Arguments
data

Dataframe

column

Column name

columns_out

Names of output columns

bag_path

Path to BAG file


Method process_files()

Run the entire process. Read files, anonymize, encrypt, write, log.

Usage
pseudoDB$process_files(files = NULL)
Arguments
files

Optional vector of filenames to process, otherwise processes all in the loaded config.


Method date_to_year()

Specific for dd-mm-yyyy dates in the data; not configurable (and not used in any application)

Usage
pseudoDB$date_to_year(data, column)
Arguments
data

Dataframe

column

Name of column


Method to_age_bracket()

Age in years to bracket (5-10, 10-15 etc.)

Usage
pseudoDB$to_age_bracket(data, columns)
Arguments
data

Dataframe

columns

Name of columns


Method keep_columns()

Keep these columns

Usage
pseudoDB$keep_columns(data, columns)
Arguments
data

Dataframe

columns

Name of columns


Method delete_columns()

Delete these columns

Usage
pseudoDB$delete_columns(data, columns)
Arguments
data

Dataframe

columns

Name of columns


Method clone()

The objects of this class are cloneable with this method.

Usage
pseudoDB$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.


moturoa/shintopseudo documentation built on Nov. 21, 2023, 6:57 p.m.