setup: Read Kantar TV-Raw-Data

Description Usage Arguments Details Value Examples

View source: R/setup.r

Description

This function is the interface and starting point to read Kantar TV-Raw-Data (sometimes called PIN-Data). The function collects parameters on how rawdata shall be imported and later analyzed, much like the interface of Instar Analytics. All parameters, together with more information about the tv rawdata, is returned as a list id. This list is input to most functions of the tv package.

Usage

1
2
3
4
5
6
7
8
9
setup(day = "2013-01-01", to = 1, obs = c("ind", "hh")[1],
  hh.calc = c("normal", "by channel")[1], guest = TRUE, dem = TRUE,
  dem.var = NULL, dem.day = NULL, dem.uni = TRUE, dem.join = FALSE,
  view = TRUE, act = c("live", "tsv", "recordedview", "teletext")[1:2],
  plt = c("terA", "terD", "satA", "satD", "cabA", "cabD", "iptv", "web",
  "unknown"), tsv.cat = c("tsv", "overnight", "none")[3],
  tsv.ref = FALSE, ttv = TRUE, tmb = list(wholeday = c(start =
  "02:00:00", end = "25:59:59")), prg = TRUE, prg.seq = c("gross",
  "net")[2], prg.join = FALSE, path = NULL, import = FALSE)

Arguments

day

A character vector. Can also be of class 'Date' (POSIX). Specifies the days to import. The only excepted format is the ISO standard format, e.g.: '2013-12-31'. There are 3 ways to specify the days to import, see details. Also see subfunction id.day.

to

Either a (single) numeric or character value. Default is 1. If day is of length one, and to is numeric, than to is taken as the desired length of the sequence. If the value is negative, the sequence runs into the past of the reference in day. If day is of length one, and to is a single date value, to is the desired end date of the sequence. If day is not of length one to is ignored.

obs

Either 'ind' (default) or 'hh'. The level of observation, either the data is on individuals level or aggregated to household level. If 'hh', dem only the "housewifes" of each household is returned. In view the viewing statements are aggregated within each household by means of the subfunction household. 'hh' is still experimental and does not match in all cases with Instar Analytics.

hh.calc

Either 'normal' (default) or 'by channel'. The type of household aggregation. If calculation TV Total Instar Analytics uses 'normal'. If calculating facts by channel the choice is 'by channel'. The former is known to return restults that match exactly with Instar, the latter not. For more dertails see ?household.

guest

TRUE (default) or FALSE. Only used if obs = 'hh'. For the household aggregation algorithm it is nescessary to know if guest should be excluded or not. Ignored otherwise. To calculate facts without guests overwrite dem <- dem[!(guest)] after importing. Re-calculate sample and universe calc.uni(dem).

dem

TRUE (default) or FALSE. Read demografics? In cases dem is not needed, setting dem = FALSE of course will speed up the whole reading process.

dem.var

Character vector specifiying the additional variables of the demorgaphics file to be imported. By default dem contains columns day,pin,weight,guest,hw. But the Kantar rawfiles .dem contains about 240 variables. Reading a small subset is much faster and gives a better overview. If other variables are of interest they have to specifies here by their correct name. The names of all variables in the rawfile is found here: id$file$dem$name and more information here id$file$dem. The names are not the sam as in Instar Analytics but shorter better suited for interactive programming.

dem.day

For the dem-file a date different to those specified in in day can be supplied. For example, if the panel is fixed to a specific sample day and their viewing in a period before and after that day is evaluated. This procedure is known from BARB. The advantage is that each person has one single weight and all together the weigths are congruent to the population. The same is achieved by filtering dem after importing of course but unnescessary dem-files will be read.

dem.uni

TRUE (default) or FALSE. Should sample and universe be calculated? See subfunction calc.uni. All demografic variables specified in dem.var above will be used as tagret group. This is not always intended. It is recommended to calculate sample and universe after importing and specifying target groups explicitly, e.g.: calc.uni(dem, target = c('sg','ageclass')). See examples.

dem.join

TRUE or FALSE (default). Should dem and view be joined? This means: view <- view[dem, on = c('day','pin')]. Only use this if you know the import is exactly as you intend it to be and you're sure to apply calc(view) directly after import. Usually it makes more sense to join dem and view after importing, and first make sure sample and universe are correct or filter guests, etc.

view

TRUE (default) or FALSE. Read viewing? In cases view is not needed, setting view = FALSE of course will speed up the whole reading process.

act

A character vector. Like in Instar Analytics possible values are live, tsv, recordedview, teletext. Default is c('live','tsv') and stands for Live and time-shifted viewing, together they yeld default currency Total-TV as returned by Instar Analytics. recordedview is not part of Total-TV, but a different representation of tsv. teletext is ot part of the currency.

plt

A character vector. Possible values are found here: id$lab$plt$name Default is to use all. plt stands for platform and is also a column in view. Dropping any intems here from the list will filter view and only return the viewing statements (rows) in view recorded on the corresponding platform. Of cource, plt in view can be filtered after importing, but to yeld household aggregation like Instar Analytics, the filtering has to be appied before the household aggregation algorithm.

tsv.cat

A character vector. Possible values are tsv, overnight, none (last one is default). tsv.cat stand for time-shifted categories which are also found in Instar Analytics. tsv labels each viewing statement tsv0, tsv1, ..., tsv7 according to the time past relative to the live broadcasting, in 24-h-steps. overnight returns labels overnight0, overnight1, ..., overnight7, according to the past number of calendar days relative to live broadcasting. none does nothing, time-shifted viewing (the .swd rawdata file) is simply read and together with live viewing (the .swo rawdata file) returned as a single data.table view. The two types of viewing can be identified by the column act.

tsv.ref

TRUE or FALSE (default). Should the two columns of the time-shifted rawdata file daytsv, starttsv be returned in view? These two columns reflect the day and time on which the tsv-viewing statement was watched. tvs categories are calculated based on this information in read.tsv but afterwards the colmns are deleted by default. Ignored if tsv.cat = none.

ttv

TRUE (default) or FALSE. ttv stands for Total-TV. Should viewing be filtered by channels that belong to the Total-TV? This is also the default in Instar Analytics. If FALSE the viewing is not filtered, hence contains also viewing statements on channels that are excluded from the standard currency.

tmb

A list of time bands of the form: list('wholeday' = c(start='02:00:00', end='25:59:59')). Multiple time bands can be specified. Each list element represents a time band. Its start and end times are given as a length two character vector. The expected time format is "hh:mm:ss". Note that end is -1 second. If the vector is named, this name will be use as label in the column tmb, otherwise a name will be automatically created based on its start and end times. Timebands are not allowed to temporally overlap and a error will be thrown. If the time bands do not cover the whole 24 hours, time band in between will be produced automaticlly, resulting in a list of timebands that always cover each second of the 24 hours. This guarantees that the sum of viewing is always the same if timeband were specified or not. The timebands of interest can by subsetted. Specifying time band(s) means the viewing statements in view will be matched against all timebands (called an overlap join, see ?foverlaps) and the overlapping viewing statements will be cropped to the overlapping time interval. specifying time bands will always result in more viewing statements (rows) in view but the sum of viewing remains unchanged.

prg

TRUE (default) or FALSE. Read program logs? In cases prog is not needed, setting prg = FALSE of course will speed up the whole reading process. Note that programs are abbreviated prg in parameter and column names but the data.table is named prog. This was nescessary due to scopeing interferance in data.table if the data.table and one of its columns share the same name.

prg.seq

Either 'gross' or 'net' (default). Programs that were aired with advertisement breaks in between have multiple records (retaining the same program-ID) in the program logs. Next to the separate 'net' sequences there is an additional entry spanning the total 'gross' timerange including ad breaks. To calculate facts by programs, the standard is to use net program duration.

prg.join

TRUE or FALSE (default). Should prog and view be joined? This means: view <- overlap.join(view, prog, type='prg'), see import. Only use this if you know the import of view and prog is as intended and you are sure to apply calc(view, by = c('day','prg')) directly after import.

path

To specify ad hoc paths. path is a named list of paths, see path(), use the very same structure as found there.

import

TRUE or FALSE (default). For convenience, start reading the files immediatly? If TRUE you'll find the four objects id,dem,view,prog in your global environment after import is finished. If FALSE you only get id, and follow up by importing dem,view,prog by means of import(id).

Details

The default values provide the minimum on information for performance reason but enough to calculate standard estimates (facts). The default values are usually the same as the default in Instar Analytics. All parameters are optional. There are many parameters to specify the data import.

Dates in day do not need to be continuous or in chronological order, e.g loading weekends only is fine. Lowest possible value is '2013-01-01'. Highest possible value is yesterday's date, e.g.: Sys.Date()-1) but depends on what is found in the /data directory in path. Like in Instar Analytics, Overnight +7 is only available after 8 days.

If some files specified to import are not found under /data, instead of breaking, the program runs with a warning listing all missing files.

Value

A list named id containing all nescessary information to read the rawdata. id is assigned to the global environment for convenience while interactive programming. At the same time the function returns the same list in the classical way, allowing assignment (e.g.: id <- setup()). The latter is nescessary to import the data within a function call. See examples. If setup(import = TRUE) the subfunction import(id) is called and consequently in addition the data.tables dem,view,prog are returned. Again, to the global environment. See examples.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# calculate standard facts for 10 days:
library(tv)
setup('2018-01-01', 10, dem.join = TRUE, import = TRUE)
calc(view)

# the same, for non-interactive programming and more educative:
id <- setup('2018-01-01', 10)
data <- import(id)

dem  <- data$dem
view <- data$view
prog <- data$prog

join <- view[dem, on = c('day','pin')]
calc(join, by = 'day')

# what is meant by interactive programming?
?interactive

# speed up import, if not all of the 3 datsets (dem, view, prg) are needed:
setup('2018-01-01', dem = TRUE, view = FALSE, prg = FALSE) # only dem

# import more demografic variables:
id$file$dem$name # all available variables of the dem-files
setup('2018-01-01', dem = TRUE, dem.var = c('sg','age','sex'), import = TRUE)

# create variable "ageclass" based on variable "age":
dem.add(dem, 'ageclass')

# calculate facts by target group:
calc.uni(dem, target = c('day','sg','ageclass','sex'))
join <- view[dem, on = c('day','pin')]
res <- calc(join, by = c('day','sg','ageclass','sex'))
res

# switch between values and labels:
res[, 'sg' := id$lab$sg[res, on = 'sg', label]]
res[, 'ageclass' := id$lab$ageclass[res, on = 'ageclass', label]]
res[, 'sex' := id$lab$sex[, x := c('male','female')][res, on = 'sex', x]] # customize
# switch back to values:
res[, 'sg' := id$lab$sg[res, on = c(label='sg'), sg]]
res[, 'ageclass' := id$lab$ageclass[res, on = c(label='ageclass'), ageclass]]
res[, 'sex' := id$lab$sex[res, on = c(x = 'sex'), sex]]

# specify date, date range
setup('2018.12.31') # Error with message what date format is expected
setup('31-12-2018') # Error
setup('2018-12-31') # correct

A. date as a vector of dates
setup(day = '2018-01-01', import = TRUE)
setup(day = c('2017-01-02','2018-01-02','2019-01-01'), import = TRUE)
dem[, .N, k = day] # number of rows per day

B. date as a continuous sequence to a reference date
setup(day = '2018-01-01', to = 10, import = TRUE) # the 10 following days
setup(day = '2018-01-01', -3, import = TRUE)      # the 3 previous days

C. date as a continuous sequence with start and end date
setup('2018-01-01', '2018-01-31', import = TRUE)

# add calendar variables "year" "month", "weekday", "weekend" etc.
dem.add(dem, 'calendar')
dem[, 'wend' := id$lab$wend[dem, on = 'wend', label]]
dem[, .N, k = .(day,wend)]

rluech/tv-clone documentation built on Jan. 7, 2022, 12:27 a.m.