SetFactorModel: Helper Function for Factor Model Data Preprocessing

Description Usage Arguments Value Cross-sectional consistency (a.k.a "balanced panel") Data cleaning procedures Author(s) References

View source: R/SetFactorModel.R


In data analyses and data mining, there are procedures regularly carried to prepare the data sets for analyses. These procedures may be simply aimed at carrying basic checks on data sets, or at applying preliminary analyses to "modify" the initial data set (among which data cleaning is perhaps the best known). This helper function aims to prepare factor model data for further analyses.


SetFactorModel(data, lrhs, clean.method, clean.bounds, across.panel, ...)



A data.frame specifying data on which the selected procedures are to be carried.


A character vector specifying the following data columns: time periods, all the independent variables, finally the dependent variable. Position matters.


A character string. One of winsor (default) or trunc.


A character vector indicating clean.method cutoffs. Default bounds are 0.5% and 99.5%.


A boolean. Would you like to clean data cross-sectionally (default) or in a time-indexed fashion?


Any additional pass through parameter. TODO: param lagged A boolean.


A data.frame with values on which the selected procedures have been applied.

Cross-sectional consistency (a.k.a "balanced panel")

TODO: crucial checks on cross-section consistency

Data cleaning procedures

The function is implemented to carry several data cleaning procedures. These procedures are often needed in empirical analyses because financial data are tipically subject to outliers. Common statistical analyses tend to suffer the effects of these extreme data points, in the sense that their output may result unreliable. Several methods, mostly in the realm of Robust Statistics, are designed to detect and alleviate the undue effects of such biases on the phenomena being analyzed. Engle et al. (2016) illustrates commonly adopted techniques in empirical finance:

These methods are summarized below to the extents of our implementation. Additional information is provided to give some background and further guidance.


This technique consists in setting "the values of a given variable that are above or below a certain cutoff to that cutoff". The objective is clearly that of dealing with "moderate" variables, to the extents the phenomena under investigation is not being substancially altered. The cutoff at which winsorization should be performed depends mainly on how noisy is the variable being analyzed, more noisy variables tends to be winsorized at a higher cutoff.


Similar to Winsorization, except that the values of a given variable that are above or below a certain cutoff are removed altogether.

Winsorization/Truncation levels

Winsorization and Truncation are usually conducted symmetrically, meaning that both series ends levels are equal. However this needs not to be. It is possible to carry the cleaning procedures at arbitrarily asymmetric levels, depending on how noisy is financial data being analyzed. This a researchers' decision.

Cross-sectional and time-indexed Winsorization/Truncation

There are two ways to perform either cleaning technique:

Which to choose depends on the type of statistical analysis to be carried. Engle et al. (2016) suggest that:

They also suggest that if any of these choices is assessed to be substantially influence analyses results, the methodology should be seen with suspicion.

Winsorize or truncate?

Whether to use either one is a difficult question to answer in general as some outliers are "legitimate" while others may be data errors. Most empirical asset pricing researchers choose to use Winsorization instead of truncation as it resembles more closely the robust approach to statistic analyses. In other words, Winsorization preserves the number of observations in the panel being analyzed and this is a good reason to prefer it. It remains, however, a researchers' decision.


Vito Lestingi


Bali, T.G., Engle, R.F., and Murray, S. (2016). Empirical Asset Pricing. The Cross Section of Stock Returns. Wiley.

JustinMShea/ExpectedReturns documentation built on Sept. 27, 2020, 5:41 p.m.