EWAS_QC: Automated Quality Control of EWAS results files
In QCEWAS: Fast and Easy Quality Control of EWAS Results Files

View source: R/script_v12-3_package.R

EWAS_QC

R Documentation

Automated Quality Control of EWAS results files

Description

The main function of the QCEWAS package. EWAS_QC accepts a single EWAS results file and runs a thorough quality check (QC), optionally applies various filters and generates QQ, Volcano and Manhattan plots. The function EWAS_series can be used to process multiple results files sequentially.

Usage

  EWAS_QC(data,
          map,
          outputname,
          header_translations,
          threshold_outliers = c(NA, NA),
          markers_to_exclude,
          exclude_outliers = FALSE,
          exclude_X = FALSE, exclude_Y = FALSE,
          save_final_dataset = TRUE, gzip_final_dataset = TRUE,
          header_final_dataset = "standard",
          high_quality_plots = FALSE,
          return_beta = FALSE, N_return_beta = 500000L,
          ...)

Arguments

`data`	a data frame with EWAS results, or the name of a file containing the same. The table must include the columns `PROBEID`, `BETA`, `SE`, and `P_VAL`. Other columns may be included but will be ignored. If the column names differ from the above, the argument `header_translations` can be used to translate them. If a filename is entered in this argument, it will be imported via the `read.table` function. `read.table` can handle a variety of formats, including files compressed in the .gz format. `EWAS_QC` will pass any named, unknown arguments to `read.table`, so you can specify the column separator and NA string with the usual `read.table` arguments. (Note that this only applied to importing the EWAS results, and not the map or translation files.)
`map`	a data frame with chromosome and position values of the probes, or the name of a file containing the same. This argument is optional: if no map is specified, `EWAS_QC` will skip the Manhattan plot and chromosome filters. `map` must include the columns `TARGETID`, `CHR` (chromosome), and `MAPINFO` (position), using those exact names. Other columns may be included but will be ignored. If a filename is entered in this argument, it will be imported via the `read.table` function. `read.table` can handle a variety of formats, including files compressed in the .gz format.
`outputname`	a character string specifying the intended filename for the output. This includes not only the cleaned results file and the log, but also any graphs created. Do not include an extension; `EWAS_QC` adds these automatically.
`header_translations`	a translation table for the column names of the input file, or the name of a file containing the same. This argument is optional: if not specified, `EWAS_QC` assumes the default column names are used. See `translate_header` for information on the format.
`threshold_outliers`	a numeric string of length two. This defines which effect sizes will be treated as outliers. The first value specifies the lower limit (i.e. markers with effect sizes below this value are considered outliers), the second the upper limit. The check for low or high outliers is skipped if the respective value is set to `NA`. To skip the check entirely, set this argument to `c(NA, NA)`.
`markers_to_exclude`	Either a vector or data frame containing a list of CpG IDs that need to be excluded before starting the QC (in case of a data frame only the first column will be processed), or the name of a file containing the same. This argument is optional: if not specified, no exclusions are made. Note that when a single value (a vector of length 1) is passed to this argument, `EWAS_QC` will treat it as a filename even when no such file can be found. If you want to remove a single CpG, either pass it to this argument via a file, or add a dummy value to the vector to give it length 2 (e.g. `c("cg02198983", "dummy")` ).
`exclude_outliers`	a logical value determining how outliers are treated. If `TRUE`, they are excluded from the final dataset. If `FALSE`, they are merely counted.
`exclude_X, exclude_Y`	logical values determining whether markers at the X and Y chromosome respectively are excluded from the final dataset. This requires providing a map to `EWAS_QC` via the `map` argument.
`save_final_dataset`	logical determining whether the cleaned dataset will be saved.
`gzip_final_dataset`	logical determining whether the saved dataset will be compressed in the .gz format.
`header_final_dataset`	either a character vector or a table determining the header names used in the final dataset, or the name of a file containing the same. If `"original"`, the final dataset will use the same column names as the original input file. If `"standard"`, it will use the default `EWAS_QC` column names. If a table, it will be passed to `translate_header` to convert the column names. If a table, the default column names (`PROBEID`, `BETA`, `SE`, and `P_VAL`) must be in the second column, and the desired column names in the first.
`high_quality_plots`	logical. Setting this to TRUE will save the graphs as high-resolution tiff images.
`return_beta, N_return_beta`	arguments used by `EWAS_series`. These are not important for users and can be ignored. For the sake of completeness: `return_beta` is a logical value; if `TRUE`, the function return value includes a vector of effect sizes. `N_return_beta` defines the length of the vector.
`...`	arguments passed to `read.table` for importing the EWAS results file.

Details

QCEWAS includes a Quick-Start guide in the doc folder of the library. This guide will explain how to run a QC and how to interpret the results. The start-up message when loading QCEWAS will indicate where it can be found on your computer. In brief, the QC consists of the following 5 stages:

Checking data integrity:

The values inside the EWAS results are tested for validity. If impossible p-values, effect-sizes, etc. are encountered, EWAS_QC generates a warning in the R console and sets them to NA.
Filter for outliers and sex-chromosomes (optional)

Counts the number of outlying markers, as well as chromosome X and Y markers, and deletes them if specified. The markers named in markers_to_exclude are removed here as well.
Generating QC plots

A histogram of beta and standard error distribution is plotted.

The p-values are checked by correlating and plotting them against p-values calculated from the effect size and standard error.

A QQ plot is generated to test for over/undersignificance.

A Manhattan plot is generated to see where the signals (if any) are located.

A Volcano plot is generated to check the distribution of effect sizes vs. p values.
Creating a QC log

The log contains notes about any problems encountered during the QC, as well as several tables describing the data.
Saving the cleaned dataset (optional)

Value

The main output of EWAS_QC are the cleaned results file, log file and QC graphs. However, the function also returns a list with 9 elements:

`data_input`	the file name of the input file, if loaded from a file. If not, this will be an empty character string.
`file`	the filename of the cleaned results file.
`QC_success`	logical, indicates whether `EWAS_QC` was able to run a full QC on the file. Note that a `TRUE` value does not mean that no problems where encountered, merely that the full QC was executed.
`lambda`	the lambda value of reported p-values in the cleaned dataset.
`p_cor`	the correlation between reported and expected (based on effect size and standard error) p values.
`N`	a named integer vector reporting how many markers were in the original dataset, how many had missing values, how many were on chromosomes X and Y, how many were outliers, how many were removed and how many are in the final, cleaned dataset. Has no relation to the `N` argument of `EWAS_series`.
`SE_median`	a numeric value: the median of the standard errors in the cleaned dataset.
`mean_methylation`	a `NULL`: this functionality has not been implemented yet.
`effect_size`	if `return_beta` is `TRUE`, this is a numeric vector of length `N_return_beta`, containing a random selection of effect sizes from the filtered dataset. If `FALSE`, this will be `NULL`.

Note

The function will return a warning if it encounters p-values < 1e-300, as this is close to the smallest number that R can process correctly. Various functions in the QCEWAS package will set these values to 1e-300 to ensure proper handling.

Examples

# For use in this example, the 2 sample files in the
# extdata folder of the QCEWAS library will be copied
# to your current R working directory. Running the QC
# generates 7 new files in your working directory:
# a cleaned, post-QC dataset, a log file, and 5 graphs.
# Consult the Quick-Start guide for more information on
# how to interpret these.
## Not run: 
file.copy(from = file.path(system.file("extdata", package = "QCEWAS"),
                           "sample_map.txt.gz"),
          to = getwd(), overwrite = FALSE, recursive = FALSE)
file.copy(from = file.path(system.file("extdata", package = "QCEWAS"),
                           "sample1.txt.gz"),
          to = getwd(), overwrite = FALSE, recursive = FALSE)

QC_results <- EWAS_QC(data = "sample1.txt.gz",
                      map = "sample_map.txt.gz",
                      outputname = "sample_output",
                      threshold_outliers = c(-20, 20),
                      exclude_outliers = FALSE,
                      exclude_X = TRUE, exclude_Y = FALSE,
                      save_final_dataset = TRUE, gzip_final_dataset = FALSE)

## End(Not run)

QCEWAS documentation built on Feb. 16, 2023, 10:30 p.m.