#' The 'ADTool' package.
#'
#' @docType package
#' @name adt_package
#' @aliases adt
#'
#' @importFrom grDevices colors
#' @importFrom graphics axis box legend lines par plot points text arrows grid rect
#' @importFrom parallel detectCores
#' @importFrom readxl read_xlsx
#' @importFrom gdata read.xls
#' @importFrom gridExtra tableGrob grid.arrange
#' @import ggplot2
#' @import dplyr
#' @import tidyr
#' @import xlsx
#'
#' @importFrom stats approxfun as.formula binomial cov density ecdf glm
#' integrate optim predict quantile sd var
#'
#' @description
#'
#' @section Background:
#'
#' Sharp increases in Alzheimer's disease (AD) cases, deaths, and costs are
#' stressing the U.S. health care system and caregivers. Without effective
#' interventions, the costs and unpaid care are projected to increase 4-fold by
#' 2050 [1]. The current scientific consensus indicated that the preclinical
#' stage, when patients are still cognitively normal, is the crucial window of
#' opportunity for early interventions before irreversible brain structure
#' changes occur [2, 3]. However, because the preclinical stage is typically
#' asymptomatic, it is important to focus on AD biomarkers, quantify their
#' ordering in time and magnitude of the signal, and estimate their evolution
#' during the preclinical progression of AD.
#'
#' @section Data Sources:
#'
#' There are several major AD data sources exist which allows researchers to
#' conduct their research.
#'
#' The BIOCARD study is a longitudinal, observational study initiated in 1995,
#' and designed to identify biomarkers associated with progression from
#' cognitively normal (CN) to mild cognitive impairment (MCI) or dementia. It
#' enrolled 349 cognitively normal individuals who were primarily middle-aged
#' (mean age=57.1) at enrollment. Most participants by design had a first-degree
#' relative with dementia. Clinical assessments and cognitive testing were
#' completed annually; MRI scans, cerebrospinal fluid, and blood specimens were
#' collected approximately every 2 years. The cohort remains being followed
#' currently with an administrative gap between 2005 and 2009. As of May 31,
#' 2018, 80 participants have cognitive symptom onset. Study characteristics are
#' shown in table 1. This study provides rare and valuable information of
#' longitudinal biomarker measurements over 20 years (median follow up 13.7
#' years) and mostly before subjects have cognitive decline – the crucial
#' preclinical time window for understanding the AD biomarker cascade.
#'
#' The ADNI study is a multicenter observation study launched in 2004, to
#' collect clinical, imaging, genetic and biospecimen biomarkers from cohorts of
#' different clinical states at baseline. It could be characterized as a
#' cross-sectional study with longitudinal follow-up in that participants in
#' diagnostic groups were very different from each other at baseline. The
#' initial phase of ADNI (2004, 5 years) enrolled 200 CN elderly, 400 MCI, and
#' 200 AD. This cohort was followed and augmented in three later phases, ADNI-GO
#' (2009, 2 years), ADNI-2 (2011, 5 years) and ADNI-3 (2016, 5 years), with over
#' 1000 additional participants enrolled in existing cohorts as well as in new
#' cohorts of non-MCI but with Significant Memory Concern (SMC), early MCI, and
#' late MCI. We will only use ADNI subjects that are not demented at baseline in
#' our analysis. This study uses a similar protocol for biomarker measurements
#' as the BIOCARD study. It complements the BIOCARD data on a larger cohort and
#' a wider spread of the AD progression continuum beyond the preclinical phase.
#'
#' The NACC UDS data is a collection of data reflecting the total enrollment
#' since 2005 across 34 Alzheimer’s Disease Centers and includes subjects with a
#' range of cognitive status. The majority of these subjects are followed
#' longitudinally with assessments obtained annually. A standard protocol is
#' used to collect clinical information, neuropsychological test results, CSF
#' markers, and MRI markers when available. This protocol will use a sub-cohort
#' of UDS data that included subjects who had at least one MRI evaluation and
#' had volumetric MRI biomarkers calculated, N= 1317 as of May 2018. This cohort
#' mostly does not have CSF or PET imaging information, but provides information
#' on a large population with intensive cognitive and MRI measures.
#'
#' @section Objectives:
#'
#' In this package, we establish AD data standards and data dictionaries in this
#' package that define the formats and organization structures of the AD data
#' across multiple data sources. R Functions are provided for data analysts to
#' integrate data from multiple data sources and create their analysis dataset.
#'
#' @references
#'
NULL
#' Dictionary of Column Names
#'
#' This dictionary maps the original column names (old_col_name) to the column
#' names of generated analysis dataset. In the raw data, the same variable may
#' have different names. This dictionary could be used to uniform variable
#' names. Also, as some column names in the raw data may be complicated
#' (including space, unit, special symbols, etc.), this dictionary is used to
#' simplify the column names.
#'
#' @name dict_src_tables
#'
#' @docType data
#'
#' @usage data(dict_src_tables)
#'
#' @format A dataframe with the following variables (all variables are in string type)
#'
#' \itemize{
#'
#' \item{index}{The index for the dictionary}
#'
#' \item{adt_col_name}{The internal column names used in merging
#' datesets.}
#'
#' \item{src_col_name}{The source column names used in the analysis
#' dateset.}
#'
#' \item{src_type}{The source of the data, such as "BIOCARD."}
#'
#' \item{adt_table_code}{The code of subtable. The options are "COG"
#' (Cognitive), "DIAG" (Diagnosis), "CSF" (FINAL_CSF), "DEMO"
#' (Demographics), "HIPPO" (Hippocampus), "AMY" (Amygdala), "EC"
#' (Entorhinal), "GE" (Genetics), "LIST_A" (list of patients not enrolled),
#' "LIST_B" (list of patients impaired). }
#'
NULL
#' Dictionary of Source files Features
#'
#' This dictionary includes the basic setting of subtables, such as "file_code",
#' "file_name", "start_row", and "date_format." In the raw data, the starting
#' row and date type may vary for different subtables. This dictionary could be
#' used to store these settings. Also, since the subtables will have different
#' names (maybe with time as a suffix), this dictionary is used to simplify the
#' subtable names.
#'
#' @name dict_src_files
#'
#' @format A dataframe with the following variables
#'
#' \itemize{
#'
#' \item{index}{The index for the dictionary}
#'
#' \item{src_type}{The source of the data, such as "BIOCARD."}
#'
#' \item{adt_table_code}{The code of subtable. The options are "COG" (Cognitive),
#' "DIAG" (Diagnosis), "CSF" (FINAL_CSF), "DEMO" (Demographics), "HIPPO"
#' (Hippocampus), "AMY" (Amygdala), "EC" (Entorhinal), "GE" (Genetics),
#' "LIST_A" (list of patients not enrolled), "LIST_B" (list of patients
#' impaired). }
#'
#' \item{src_key_words}{The identifiable keywords in the subtable names.}
#'
#' \item{src_start_row}{Integers indicating the starting row when reading
#' subtables. The default value is 1.}
#'
#' \item{src_date_format}{The format of the date in subtables. The default is
#' "%Y-%m-%d."}
#'
#' }
NULL
#' Dictionary of Data
#'
#' This dictionary includes all the variables names in the analysis dataset.
#' This dictionary could be used to check the location and discription of each
#' variable.
#'
#' @name dict_data
#'
#' @format A dataframe with the following variables:
#' \itemize{
#'
#' \item{Column_Name}{The name of variables in the analysis dataset.}
#'
#' \item{Location}{The location where this variable from.}
#'
#' \item{Description}{A description introduces the data type, range, meaning,
#' and mapping (if use one hot encoding).} }
NULL
#' ADNI Dictionary data
#'
#' Dictionary for ADNI files
#'
#' @name dict_adni
#'
#' @format A dataframe with the following variables:
#' \itemize{
#' \item{file_label}{}
#' \item{file_name}{}
#' }
NULL
#' Parameters
#'
#' List of function parameters in the \code{ADTool} package
#'
#' @name parameters
#'
#' @param path Directory of the BIOCARD data files. Default value is the working
#' directory,
#'
#' @param merge_by A character string indicating the source baseline time for
#' aligning the BIOCARD data when merging multiple files. Options include
#' "diagnosis", "cognitive", "csf", "hippocampus", "amydata" and
#' "entorhinal".
#'
#' @param window An integer (unit of days) indicating the maximum acceptable gap
#' time for merging a biomarker test to base data. Default is 730 days. For
#' most data, the time windows are calculated from the baseline time (left
#' window = the midpoint between current and the previous time point; right
#' window = the midpoint between current and the next time point). If the
#' time windows are longer than the maximum acceptable length, then force to
#' select biomarkers within the maximum window length we set. For the first
#' and last baseline time, since there is no "previous" or "next" time point
#' available, use the maximum acceptable window length instead.
#'
#' @param window_overlap A logical value indicating the time window setting.
#' Default "False." If true, all time windows will set from 1/2 "window"
#' days before the baseline time to the 1/2 "window" days after baseline. In
#' this case, the time windows may overlap, which means some biomarkers may
#' be merged into multiple baseline data. If false, the windows will be
#' calculated from the baseline times. The left window is set to be the
#' midpoint between the current and the previous time point. The right
#' window is set to be the midpoint between the current and the next time
#' point. For the first and last baseline time, since there is no "previous"
#' or "next" time point available, use the maximum acceptable window length
#' (set by the parameter "window") instead.
#'
#' @param pattern A string indicating the pattern of all the data files. Default
#' is "*.xls" (should work for both .xls and .xlsx). This pattern is used to
#' read all table names from the path.
#'
#' @param src_files Updated dictionary file for source file features. See
#' \code{\link{dict_src_files}} for more details.
#'
#' @param src_tables Updated dictionary file for source table features. See
#' \code{\link{dict_src_tables}} for more details.
#'
#' @param vec_name A character string indicating the vector needed to be removed special characters.
#'
#' @param lower_case A logical value indicating the lower case setting.
#' Default "False." If true, all characters will to convert to lower cases.
#'
#' @param src_type A string indicating the source of data. Options include
#' "BIOCARD", "NACC" and "ADNI"
#'
#' @param col_name A string indicating a column name in the source data file.
#'
#' @param dict_src_tables A dictionary file for the parameters. The
#' default dictionary is appended in the package. Since the parameters in
#' tables may vary, it could be modified by using a costomized parameter
#' dictionary. The format can refer to the appended dictionary.
#' See \code{\link{dict_src_tables}} for more details.
#'
#' @param table_code Code of subtables ("Cognitive" as "COG"), which can be mapped
#' to the source data files and is used for internal data manipulation.
#' The options are "COG" (Cognitive), "DIAG" (Diagnosis), "CSF" (FINAL_CSF), "DEMO"
#' (Demographics), "HIPPO" (Hippocampus), "AMY" (Amygdala), "EC"
#' (Entorhinal), "GE" (Genetics), "LIST_A" (list of patients not enrolled),
#' "LIST_B" (list of patients impaired).
#'
#' @param file_names List the names of all subtables in the path (all tables
#' should from the same path).
#'
#' @param dict_src_files A path of data file (.csv) of the parameters dictionary. The
#' default dictionary is appended in the package. Since the parameters in
#' tables may vary, it could be modified by using a costomized parameter
#' dictionary. The format can refer to the appended dictionary.
#' See \code{\link{dict_src_files}} for more details.
#'
#' @param dat The baseline time dataset.
#'
#' @param v_date A string indicating the date name of the baseline time.
#'
#' @param v_id A string indicating the id name of baseline dateset. Default is
#' "subject_id" (may varied if using customized column names dictionary).
#'
#' @param dat_se A dataset that will be merged.
#'
#' @param dat_marker A data set including new biomarker that will be merged.
#'
#' @param m_date A string indicating the name of measure date in dat_marker.
#'
#' @param duplist A list of duplicated columns names.
#'
#' @param rst_dict A dataset that waiting to be updated.
#'
#' @param csv_dict A dataset with updated information.
#'
#' @param cur_dat A dataset indicating current loaded subtable.
#'
#' @param data A dataset. Possible choices are "dict_tbl": the dictionary for
#' the structure of tables, "dict_cil_name": the dictionary maps the
#' original column names (old_col_name) to the column names of generated
#' analysis dataset, "dict_data": This dictionary includes all the variables
#' names in the analysis dataset. This dictionary could be used to check the
#' location and description of each variable.
#'
#' @param str String to search. Default value is NULL. If NULL, this function is
#' used to check the entire dictionary. If input a string (the variable name
#' of the analysis dataset), this function is used to check the location and
#' description of the variable.
#'
#' @param csv_fname A string indicating the CSV filename to be saved (saved to
#' the default location). Default is NULL. The saved file is in .scv format.
#' If changes of default dictionaries needed, then the costomized dictionary
#' could be added here. However, the format of the costomized dictionary
#' should follow the default one.
#'
#' @param dict A string indicating the name of the dictionary. The available
#' options are: "src_files": the dictionary for table structure, "src_tables": the
#' dictionary for mapping column names, "ana_data": the dictionary for query, "adni": the dictionary for ADNI
#' data.
#'
#' @param apoecode A string indicating the APOE. The available
#' options are: 3.4, 4.4, 2.4
#'
#' @param par_apoe A list indicating the map for apoecode. Default
#' value is list(levels = c(3.4, 4.4, 2.4),labels = c(1, 2, NA))).
#'
#' @param ad_ana_data The analysis data get from adt_biocard() function.
#'
#' @param ...
#'
#' @param vars A list of strings indicating categorical variables name.
#'
#' @param x The analysis data set
#'
#' @param var A string indicating the queried variable
#'
#' @param stack_by A string indicating the way to group data. Options are "sex" and "diagnosis".
#'
#' @param distn A string indicating the interested distribution. Options are "age", "visit", and "yar"
#'
#' @param pat_sub A data set contains needed information for ploting
#'
NULL
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.