calibrate: Calibrate raw data to crisp or fuzzy sets
In QCA: Qualitative Comparative Analysis

Description Usage Arguments Details Value Author(s) References Examples

This function transforms (calibrates) the raw data to either crisp or fuzzy sets values, using both the direct and the indirect methods of calibration.

1 2	calibrate(x, type = "fuzzy", method = "direct", thresholds = NA, logistic = TRUE, idm = 0.95, ecdf = FALSE, below = 1, above = 1, ...)

`x`	A numerical causal condition.
`type`	Calibration type, either `"crisp"` or `"fuzzy"`.
`method`	Calibration method, either `"direct"`, `"indirect"` or `"TFR"`.
`thresholds`	A vector of (named) thresholds.
`logistic`	Calibrate to fuzzy sets using the logistic function.
`idm`	The set inclusion degree of membership for the logistic function.
`ecdf`	Calibrate to fuzzy sets using the empirical cumulative distribution function of the raw data.
`below`	Numeric (non-negative), determines the shape below crossover.
`above`	Numeric (non-negative), determines the shape above crossover.
`...`	Additional parameters, mainly for backwards compatibility.

Calibration is a transformational process from raw numerical data (interval or ratio level of measurement) to set membership scores, based on a certain number of qualitative anchors.

When type = "crisp", the process is similar to recoding the original values to a number of categories defined by the number of thresholds. For one threshold, the calibration produces two categories (intervals): 0 if below, 1 if above. For two thresholds, the calibration produces three categories: 0 if below the first threshold, 1 if in the interval between the thresholds and 2 if above the second threshold etc.

When type = "fuzzy", calibration produces fuzzy set membership scores, using three anchors for the increasing or decreasing s-shaped distributions (including the logistic function), and six anchors for the increasing or decreasing bell-shaped distributions.

The argument thresholds can be specified either as a simple numeric vector, or as a named numeric vector. If used as a named vector, for the first category of s-shaped distributions, the names of the thresholds should be:

`"e"`	for the full set exclusion
`"c"`	for the set crossover
`"i"`	for the full set inclusion

For the second category of bell-shaped distributions, the names of the thresholds should be:

`"e1"`	for the first (left) threshold for full set exclusion
`"c1"`	for the first (left) threshold for set crossover
`"i1"`	for the first (left) threshold for full set inclusion
`"i2"`	for the second (right) threshold for full set inclusion
`"c2"`	for the second (right) threshold for set crossover
`"e2"`	for the second (right) threshold for full set exclusion

If used as a simple numerical vector, the order of the values matter.

If e < c < i, then the membership function is increasing from e to i. If i < c < e, then the membership function is decreasing from i to e.

Same for the bell-shaped distribution, if e1 < c1 < i1 ≤ i2 < c2 < e2, then the membership function is first increasing from e1 to i1, then flat between i1 and i2, and then decreasing from i2 to e2. In contrast, if i1 < c1 < e1 ≤ e2 < c2 < i1, then the membership function is first decreasing from i1 to e1, then flat between e1 and e2, and finally increasing from e2 to i2.

When logistic = TRUE (the default), the argument idm specifies the inclusion degree of membership for the logistic function. If logistic = FALSE, the function returns linear s-shaped or bell-shaped distributions (curved using the arguments below and above), unless activating the argument ecdf.

If there is no prior knowledge on the shape of the distribution, the argument ecdf asks the computer to determine the underlying distribution of the empirical, observed points, and the calibrated measures are found along that distribution.

Both logistic and ecdf arguments can be used only for s-shaped distributions (using 3 thresholds), and they are mutually exclusive.

The parameters below and above (active only when both logistic and ecdf are deactivated, establish the degree of concentration and dilation (convex or concave shape) between the threshold and crossover:

`0 < below < 1`	dilates in a concave shape below the crossover
`below = 1`	produces a linear shape (neither convex, nor concave)
`below > 1`	concentrates in a convex shape below the crossover
`0 < above < 1`	dilates in a concave shape above the crossover
`above = 1`	produces a linear shape (neither convex, nor concave)
`above > 1`	concentrates in a convex shape above the crossover

Usually, below and above have equal values, unless specific reasons exist to make them different.

For the type = "fuzzy" it is also possible to use the "indirect" method to calibrate the data, using a procedure first introduced by Ragin (2008). The indirect method assumes a vector of thresholds to cut the original data into equal intervals, then it applies a (quasi)binomial logistic regression with a fractional polynomial equation.

The results are also fuzzy between 0 and 1, but the method is entirely different: it has no anchors (specific to the direct method), and it doesn't need to specify a calibration function to calculate the scores with.

The third method applied to fuzzy calibrations is called "TFR" and calibrates categorical data (such as Likert type response scales) to fuzzy values using the Totally Fuzzy and Relative method (Chelli and Lemmi, 1995).

A numeric vector of set membership scores, either crisp (starting from 0 with increments of 1), or fuzzy numeric values between 0 and 1.

Adrian Dusa

Cheli, B.; Lemmi, A. (1995) “A 'Totally' Fuzzy and Relative Approach to the Multidimensional Analysis of Poverty”. In Economic Notes, vol.1, pp.115-134.

Dusa, A. (2019) QCA with R. A Comprehensive Resource. Springer International Publishing, DOI: 10.1007/978-3-319-75668-4.

Ragin, C. (2008) “Fuzzy Sets: Calibration Versus Measurement.” In The Oxford Handbook of Political Methodology, edited by Janet Box-Steffensmeier, Henry E. Brady, and David Collier, pp.87-121. Oxford: Oxford University Press.

# generate heights for 100 people
# with an average of 175cm and a standard deviation of 10cm
set.seed(12345)
x <- rnorm(n = 100, mean = 175, sd = 10)


cx <- calibrate(x, type = "crisp", thresholds = 175)
plot(x, cx, main="Binary crisp set using 1 threshold",
     xlab = "Raw data", ylab = "Calibrated data", yaxt="n")
axis(2, at = 0:1)


cx <- calibrate(x, type = "crisp", thresholds = c(170, 180))
plot(x, cx, main="3 value crisp set using 2 thresholds",
     xlab = "Raw data", ylab = "Calibrated data", yaxt="n")
axis(2, at = 0:2)


# calibrate to a increasing, s-shaped fuzzy-set
cx <- calibrate(x, thresholds = "e=165, c=175, i=185")
plot(x, cx, main = "Membership scores in the set of tall people", 
     xlab = "Raw data", ylab = "Calibrated data")

     
# calibrate to an decreasing, s-shaped fuzzy-set
cx <- calibrate(x, thresholds = "i=165, c=175, e=185")
plot(x, cx, main = "Membership scores in the set of short people", 
     xlab = "Raw data", ylab = "Calibrated data")


# when not using the logistic function, linear increase
cx <- calibrate(x, thresholds = "e=165, c=175, i=185", logistic = FALSE)
plot(x, cx, main = "Membership scores in the set of tall people", 
     xlab = "Raw data", ylab = "Calibrated data")


# tweaking the parameters "below" and "above" the crossover,
# at value 3.5 approximates a logistic distribution, when e=155 and i=195
cx <- calibrate(x, thresholds = "e=155, c=175, i=195", logistic = FALSE,
      below = 3.5, above = 3.5)
plot(x, cx, main = "Membership scores in the set of tall people", 
     xlab = "Raw data", ylab = "Calibrated data")


# calibrate to a bell-shaped fuzzy set
cx <- calibrate(x, thresholds = "e1=155, c1=165, i1=175, i2=175, c2=185, e2=195",
      below = 3, above = 3)
plot(x, cx, main = "Membership scores in the set of average height",
     xlab = "Raw data", ylab = "Calibrated data")


# calibrate to an inverse bell-shaped fuzzy set
cx <- calibrate(x, thresholds = "i1=155, c1=165, e1=175, e2=175, c2=185, i2=195",
      below = 3, above = 3)
plot(x, cx, main = "Membership scores in the set of non-average height",
     xlab = "Raw data", ylab = "Calibrated data")


# the default values of "below" and "above" will produce a triangular shape
cx <- calibrate(x, thresholds = "e1=155, c1=165, i1=175, i2=175, c2=185, e2=195")
plot(x, cx, main = "Membership scores in the set of average height",
     xlab = "Raw data", ylab = "Calibrated data")


# different thresholds to produce a linear trapezoidal shape
cx <- calibrate(x, thresholds = "e1=155, c1=165, i1=172, i2=179, c2=187, e2=195")
plot(x, cx, main = "Membership scores in the set of average height",
     xlab = "Raw data", ylab = "Calibrated data")


# larger values of above and below will increase membership in or out of the set
cx <- calibrate(x, thresholds = "e1=155, c1=165, i1=175, i2=175, c2=185, e2=195",
      below = 10, above = 10)
plot(x, cx, main = "Membership scores in the set of average height",
     xlab = "Raw data", ylab = "Calibrated data")


# while extremely large values will produce virtually crisp results
cx <- calibrate(x, thresholds = "e1=155, c1=165, i1=175, i2=175, c2=185, e2=195",
      below = 10000, above = 10000)
plot(x, cx, main = "Binary crisp scores in the set of average height",
     xlab = "Raw data", ylab = "Calibrated data", yaxt="n")
axis(2, at = 0:1)
abline(v = c(165, 185), col = "red", lty = 2)

# check if crisp
round(cx, 0)


# using the empirical cumulative distribution function
# require manually setting logistic to FALSE
cx <- calibrate(x, thresholds = "e=155, c=175, i=195", logistic = FALSE,
      ecdf = TRUE)
plot(x, cx, main = "Membership scores in the set of tall people", 
     xlab = "Raw data", ylab = "Calibrated data")


## the indirect method, per capita income data from Ragin (2008)
inc <- c(40110, 34400, 25200, 24920, 20060, 17090, 15320, 13680, 11720,
         11290, 10940, 9800, 7470, 4670, 4100, 4070, 3740, 3690, 3590,
         2980, 1000, 650, 450, 110)

cinc <- calibrate(inc, method = "indirect",
        thresholds = "1000, 4000, 5000, 10000, 20000")

plot(inc, cinc, main = "Membership scores in the set of high income", 
     xlab = "Raw data", ylab = "Calibrated data")
     

# calibrating categorical data
set.seed(12345)
values <- sample(1:7, 100, replace = TRUE)

TFR <- calibrate(values, method = "TFR")

table(round(TFR, 3))