SetFactorModel: Helper Function for Factor Model Data Preprocessing
In JustinMShea/ExpectedReturns: Reproduction of Investment Strategies presented in Antti Ilmanen's Expected Returns and Investing Amid Low Expected Returns

SetFactorModel

R Documentation

Helper Function for Factor Model Data Preprocessing

Description

In data analyses and data mining, there are procedures regularly carried to prepare the data sets for analyses. These procedures may be simply aimed at carrying basic checks on data sets, or at applying preliminary analyses to "modify" the initial data set (among which data cleaning is perhaps the best known). This helper function aims to prepare factor model data for further analyses.

Usage

SetFactorModel(data, lrhs, clean.method, clean.bounds, across.panel, ...)

Arguments

`data`	A `data.frame` specifying data on which the selected procedures are to be carried.
`lrhs`	A character vector specifying the following `data` columns: time periods, all the independent variables, finally the dependent variable. Position matters.
`clean.method`	A character string. One of `winsor` (default) or `trunc`.
`clean.bounds`	A character vector indicating `clean.method` cutoffs. Default bounds are 0.5% and 99.5%.
`across.panel`	A boolean. Would you like to clean `data` cross-sectionally (default) or in a time-indexed fashion?
`...`	Any additional pass through parameter. TODO: param lagged A boolean.

Value

A data.frame with values on which the selected procedures have been applied.

Cross-sectional consistency (a.k.a "balanced panel")

TODO: crucial checks on cross-section consistency

Data cleaning procedures

The function is implemented to carry several data cleaning procedures. These procedures are often needed in empirical analyses because financial data are tipically subject to outliers. Common statistical analyses tend to suffer the effects of these extreme data points, in the sense that their output may result unreliable. Several methods, mostly in the realm of Robust Statistics, are designed to detect and alleviate the undue effects of such biases on the phenomena being analyzed. Engle et al. (2016) illustrates commonly adopted techniques in empirical finance:

Winsorization
Truncation

These methods are summarized below to the extents of our implementation. Additional information is provided to give some background and further guidance.

Winsorization

This technique consists in setting "the values of a given variable that are above or below a certain cutoff to that cutoff". The objective is clearly that of dealing with "moderate" variables, to the extents the phenomena under investigation is not being substancially altered. The cutoff at which winsorization should be performed depends mainly on how noisy is the variable being analyzed, more noisy variables tends to be winsorized at a higher cutoff.

Truncation

Similar to Winsorization, except that the values of a given variable that are above or below a certain cutoff are removed altogether.

Winsorization/Truncation levels

Winsorization and Truncation are usually conducted symmetrically, meaning that both series ends levels are equal. However this needs not to be. It is possible to carry the cleaning procedures at arbitrarily asymmetric levels, depending on how noisy is financial data being analyzed. This a researchers' decision.

Cross-sectional and time-indexed Winsorization/Truncation

There are two ways to perform either cleaning technique:

Cross-sectionally. Percentiles are based on all values of the given variables cross-section.
Time-indexed. Percentiles are computed based on each time period separately.

Which to choose depends on the type of statistical analysis to be carried. Engle et al. (2016) suggest that:

if a single-stage analysis will be performed on the entire panel of data, the first method is most appropriate;
in two-stage analyses the second approach is usually preferable.

They also suggest that if any of these choices is assessed to be substantially influence analyses results, the methodology should be seen with suspicion.

Winsorize or truncate?

Whether to use either one is a difficult question to answer in general as some outliers are "legitimate" while others may be data errors. Most empirical asset pricing researchers choose to use Winsorization instead of truncation as it resembles more closely the robust approach to statistic analyses. In other words, Winsorization preserves the number of observations in the panel being analyzed and this is a good reason to prefer it. It remains, however, a researchers' decision.

Author(s)

Vito Lestingi

References

Bali, T.G., Engle, R.F., and Murray, S. (2016). Empirical Asset Pricing. The Cross Section of Stock Returns. Wiley.

JustinMShea/ExpectedReturns documentation built on June 14, 2025, 4:28 p.m.

JustinMShea/ExpectedReturns index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

JustinMShea/ExpectedReturns
Reproduction of Investment Strategies presented in Antti Ilmanen's Expected Returns and Investing Amid Low Expected Returns

SetFactorModel: Helper Function for Factor Model Data Preprocessing
In JustinMShea/ExpectedReturns: Reproduction of Investment Strategies presented in Antti Ilmanen's Expected Returns and Investing Amid Low Expected Returns

Helper Function for Factor Model Data Preprocessing

Description

Usage

Arguments

Value

Cross-sectional consistency (a.k.a "balanced panel")

Data cleaning procedures

Winsorization

Truncation

Winsorization/Truncation levels

Cross-sectional and time-indexed Winsorization/Truncation

Winsorize or truncate?

Author(s)

References

Related to SetFactorModel in JustinMShea/ExpectedReturns...

R Package Documentation

Browse R Packages

We want your feedback!

JustinMShea/ExpectedReturns Reproduction of Investment Strategies presented in Antti Ilmanen's Expected Returns and Investing Amid Low Expected Returns

SetFactorModel: Helper Function for Factor Model Data Preprocessing In JustinMShea/ExpectedReturns: Reproduction of Investment Strategies presented in Antti Ilmanen's Expected Returns and Investing Amid Low Expected Returns

Helper Function for Factor Model Data Preprocessing

Description

Usage

Arguments

Value

Cross-sectional consistency (a.k.a "balanced panel")

Data cleaning procedures

Winsorization

Truncation

Winsorization/Truncation levels

Cross-sectional and time-indexed Winsorization/Truncation

Winsorize or truncate?

Author(s)

References

Related to SetFactorModel in JustinMShea/ExpectedReturns...

R Package Documentation

Browse R Packages

We want your feedback!

JustinMShea/ExpectedReturns
Reproduction of Investment Strategies presented in Antti Ilmanen's Expected Returns and Investing Amid Low Expected Returns

SetFactorModel: Helper Function for Factor Model Data Preprocessing
In JustinMShea/ExpectedReturns: Reproduction of Investment Strategies presented in Antti Ilmanen's Expected Returns and Investing Amid Low Expected Returns