# Variable Domains and Data Editing In netropy: Statistical Entropy Analysis of Network Data

```knitr::opts_chunk\$set(out.width = "100%",
cache = FALSE
)
```

Data editing should guarantee that all variables are carefully defined with specified domains where they are observed or measured and with specified range spaces where their possible values are given. A usual set-up is that all variables have the same domain, for instance a set of individuals for which several attributes are registered. If a binary relationship like friendship or not friendship is registered for all pairs of individuals, then the domain for this relationship is the set of dyads (pairs without regard to order) of individuals.

Variables with different domains can sometimes be combined. For instance, node or vertex variable \$X\$ with value \$X_u\$ for node \$u\$ can be extended to the domain of dyads by defining it as the pair \$X_{uv} = (X_u,X_v)\$ for dyad \$(u,v)\$. Examples of such variable transformations are given in Frank & Shafie (2016) (link).

Note that is also possible to create a new variable on units from a tie variable \$Y\$ with values \$Y_{uv}\$ on dyads by aggregating in some way the values on dyads incident to a node \$u\$ to get a value \$Y_u\$ at that node. For instance, the edge indicators in a graph can be aggregated to degrees or other centrality measures at the vertices.

Thus, variables can be observed and transformed on three variable domains: vertex, dyad and triad variables. This is exemplified in the following.

```library('netropy')
```

To load the internal data set (link), extract each object and assign the correct names to them:

```data(lawdata)
df.att <- lawdata[[4]]
```

## Example: observed and transformed vertex variables

Observed node variables are the attributes in the data frame `df.att` with 71 observations. All variables, except `years` and `age`, are categorical with finite range spaces and therefore kept in their original form. The variable `years` and `age` need to be categorized using their cumulative distribution functions (cdf) and creating approximately equally sized categories. These cdf's can be obtained by

```# for years:
x <- table(df.att\$years)
values<-as.numeric(names(x))
prop.x <- round(prop.table(x),2)
cum.prop=cumsum(prop.x)
frq.years = data.frame(value=values,freq=as.vector(x),
rel.freq=as.vector(prop.x),
cum.rel.freq=as.vector(cum.prop))

# for age:
x <- table(df.att\$age)
values<-as.numeric(names(x))
prop.x <- round(prop.table(x),2)
cum.prop=cumsum(prop.x)
frq.age = data.frame(value=values,freq=as.vector(x),
rel.freq=as.vector(prop.x),
cum.rel.freq=as.vector(cum.prop))
```

By looking at these cdf's we can find values with which we base the categories on:

```frq.years
frq.age
```

We base the categorization of these two variables variables on values yielding three approximately equally sized categories (i.e. approximately 30% of the cdf) and merge all variables into a new dataframe `att_var`:

```att.var <-
data.frame(
status   = df.att\$status-1,
gender   = df.att\$gender,
office   = df.att\$office-1,
years    = ifelse(df.att\$years <= 3,0,
ifelse(df.att\$years <= 13,1,2)),
age      = ifelse(df.att\$age <= 35,0,
ifelse(df.att\$age <= 45,1,2)),
practice = df.att\$practice,
lawschool= df.att\$lawschool-1
)
```

Note that we for sake of consistency also edit all variables such that their outcomes start from the value 0. The variable `senior` only has unique values, thus it is redundant and omitted (later we illustrate how such redundant variables can be detected using bivariate entropies).

To transform observed dyad variables into node variables, node degrees of each network (in- and out-degree for directed advice and friendship) can be computed and categorized as shown above for `years` and `age`.

## Example: observed and transformed dyad variables

Dyad variables are given as pairs of incident vertex variables with \$\binom{71}{2}=2485\$ observations (number of rows in the dataframes created in the following). Observed node attribute in the dataframe `att_var` are thus given by pairs of individual attributes. For example, `status` with binary outcomes is transformed into dyads having 4 possible outcomes \$(0,0), (0,1), (1,0), (1,1)\$ and `office` with three categorical outcomes gives dyads with 9 possible outcome \$(0,0), (0,1), (0,2), (1,0), (1,1), (1,2),(2,0),(2,1),(2,2)\$. These transformations can be done using the function `get_dyad_variables()` for each vertex variable using the argument `type = att` which specifies that we are using vertex attributes as input variable:

```dyad.status    <- get_dyad_var(att.var\$status, type = 'att')
```

Note that the outcomes are recoded to numerical values to avoid character objects when performing the entropy analysis (in practice though, the actual values of variables are irrelevant for the entropy analysis as we only care about frequencies of occurrence). Thus, `status` has outcomes 0-3 and `office` has 0-8.

Similarly, dyad variables can be created based on observed ties. For the undirected edges, we use indicator variables read directly from the adjacency matrix for the dyad in question, while for the directed ones (`advice` and `friendship`) we have pairs of indicators representing sending and receiving ties with 4 possible outcomes:

```dyad.cwk    <- get_dyad_var(adj.cowork, type = 'tie')
```

All 10 dyad variables are merged into one data frame for subsequent entropy analysis:

```dyad.var <-
)
```

A similar function `get_triad_var()` is implemented for transforming vertex variables and different relation types into triad variables. These triad variables have \$\binom{71}{3}=57155\$ observations in the law data set and are given as triples of individual attributes or by the relations among the three nodes. Similarly as for dyad variables, we call the function and specify argument for type of variable (a column vector as input when considering vertex attributes and an adjacency matrix when considering ties). For the vertex variables we thus obtain the triad variables by:

```triad.status    <- get_triad_var(att.var\$status, type = 'att')
```

Note that binary attributes have 8 possible triadic outcomes \$\$ (0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0), (0, 0, 1), (1, 0, 1), (0, 1, 1), (1, 1, 1)\$\$ coded 0-7 and attributes with three possible outcomes will yield triads with 27 possible outcomes coded 0-26.

The undirected ties are transformed from having binary outcomes into triad variables with 8 possible outcomes, and directed ties are transformed from having 4 possible outcomes into triad variables with 64 possible outcome representing possible triadic combinations of sending and receiving ties.

```triad.cwk    <- get_triad_var(adj.cowork, type = 'tie')
```

All triad variables are then merged into one data frame for subsequent entropy analysis.

```triad.var <- data.frame(cbind(
)
```

## References

Frank, O., & Shafie, T. (2016). Multivariate entropy analysis of network data. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 129(1), 45-63. link

Nowicki, K., Shafie, T., & Frank, O. (Forthcoming 2022). Statistical Entropy Analysis of Network Data.

## Try the netropy package in your browser

Any scripts or data that you put into this service are public.

netropy documentation built on Feb. 2, 2022, 9:07 a.m.