read.dem: Reads the .dem-file

Description Usage Details Value

View source: R/read_dem.r

Description

The dem file contains the socio-demografic information of the panelists at a given day.

Usage

1

Details

There are about 4500 rows in one file for the roughly 4500 panelists per day. The panel member id is column 'pin'. 'pin' in fact is the combination of the household id and the individual id. Most panelmembers will occure the following days but not all, a household can drop out of the panel for different reasons temporarily or for ever. Not all Panelmember are watching TV on a given day, hence not all panelmember will appera in the viewing files.

There are about 200 columns but the number of columns is in fact growing constantly, hence the shape of the dem file is changing over time. See below.

The most importannt columns are the first ones: day, pin, weight, guest These four columns are read by default. Any additional column has to be specified by its name to be read. The exact name of each column can be retrieved from the 'name' column in the 'file_dem' table.

The 'dem' file, like all Kantar Media TV rawdata is stored as flat text fixed with files. The table 'file_dem' gives the details on how the file has to be read. The .dem file is read by using 'read_fwf' from the 'readr' package. read_fwf() apparently is very robust. It still reads the file correctly even if the last fields are missing. It returns the column as NA with a warning. So it is ok to request a variable (say, "upc-abonnement") that does not exist in the dem file on that particular day. When combining all requested daily files into one data.table dem using rbindlist(..., fill = TRUE), that column with the fully or patially missing data for some days is represented by NAs. Therefore it is not necessary to know the number of fields for a particular day.

Nevertheless, a function returning the number of fields would look like this: nvar <- function(x) # use like: n <- nvar(date) id.n <- c( '2013-01-01'=107, '2014-02-27'=108, '2015-02-05'=157, '2015-07-01'=158, '2016-01-01'=159, '2016-02-02'=181, '2016-07-01'=182, '2016-12-05'=186 ) id.n[findInterval(as.Date(x), as.Date(names(id.n)))]

Other than that, there should never be NAs in dem and it would possibly mean a fatal error in the data production by Kantar. Such incindets have occured in the past. An NAs is a empty character field "" in the rawdata. It is not trivial to verify in Excel for example that a NA is truly an emply field as the can be non-printable ascii character at least for 'age'. However ESXCEL::=ISTLEER() proofs there are indeed missing values, e.g. there is NA in age found during April 2013!

age day N 1: NA 2013-04-08 1 2: NA 2013-04-10 1 3: NA 2013-04-13 1 4: NA 2013-04-20 1 5: NA 2013-04-23 1 6: NA 2013-04-24 1 7: NA 2013-04-27 1

All columns but the first three are stored by a field width of 1. Using a the extended ascii character set a length one filed can represent a variable with more than 100 levels, such as 'kanton' or 'age'. All columns get converted to integer. Exceptions are day (character), weigth (numeric), guest (logical).

Every column has labels for its corresponding values, like factor levels. The labels are stored in id$lab and each column in dem has its corresponding element of the same name.

Value

A data.table. All column are of type integer but day (chr), weight (num) and guest (log)


rluech/tv-clone documentation built on Jan. 7, 2022, 12:27 a.m.