vidente: vidente: A package for parsing and preprocessing SEER data.

Description Parsing data Preprocessing data


The vidente package provides two categories of important functions. Functions to parse SEER data and functions to preprocess it.

Parsing data

The buildSEERParser function builds parsing instructions based on the instructions in the downloaded folder (.sas file) or in the dictionary file exported from SEER*Stat software.

The readSEER function reads the SEER data from ASCII text files downloaded from SEER website or exported from SEER*Stat software based on the instructions provided in the dictionary (.dic) or .sas file.

The listPrimarySites function provides a list of keywords recognized recognized as primary site names in the terminology adopted by SEER so that you know what primary sites you can provide as the primary_site parameter for the readSEER function.

Preprocessing data

The plotHistNA function plots a histogram of the proportion of NA values for every feature in the dataframe.

The removeFullNAFeatures function removes features whose all values are NA (or along with some additional NA value such as "Blank(s)", as some datasets exported from SEER*Stat software).

The findSingleValueFeatures function finds features with an unique value for all rows in the dataframe. This can help you find features that can be removed for they only add overhead to the analysis.

The getNormalizedEntropy function calculates the normalized entropy by dividing the entropy by the information length (number of unique possible values by feature). This ratio is also called metric entropy and is a measure of randomness of the information.

