man/codebook.md

title: "Getting and cleaning data" output: github_document

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

Introduction

The objetive of this program is get ad make tidy som information about the accelerometers from the Samsung Galaxy S smartphone.

All info is provided by: UCI - Machine Learning Repository

Files can be downloaded in zip format, and all information about the data can be found there.

Assumptions

About files

Although data is splitted into several files, I'm focused on:

And inside the data, only get information about medians and standard deviations

I don't assume anything about data but structure of zipped file and naming conventions, that is:

1.- Activity_labels.txt and features.txt are in root tree
2.- There are several folders according the requeriments for the machine learning: train, test, val, ...
3.- Each folder has a name (xxx) and contains three files:
    3.1 subject_xxx.txt Indicating the subject Id
    3.2 y_xxx.txt Indicating the activity of subject
    3.3 x_xxx.txt The observations

About data

Approach

Main features are:

1.- Make the code as abstract(generic) as possible 2.- Minimize the use of memory 3.- Minimize readings

There are two basic process:

1.- Load the data into memory
2.- Make it tidy

Load the data

Data is loaded directly from Internet or use a previously downloaded file according the prefix of file name (http or not) and it is readed without decompressing.

Flow is:

1.- loadFile -> get access to file
2.- Process features / Select columna names to read in order to avoid read the full data
3.- For each directory (train, test, etc)
    3.1 .- Load subject file
    3.2 .- Load Y axis
    3.3 .- Load selected columns from X data
4.- Mark Subject and activities as factor

The relationship between files is:

 +--------+  1    1  +---------+  1    1  +---------+
 | X_data | <------> | subject | <------> |    Y    |
 +--------+ By row   +--------+   By row  +---------+

The resulting dataframe is:

+---------+----------+-----------------------------+
| subject | activity | vector of selected columns  |
| subject | activity | vector of selected columns  |
|  ...    |    ...   |          ....               |
+---------+----------+-----------------------------+

Tidying the data

Once data are loaded into memory they are like this:

+---------+----------+-------------+------------+-------------+---------+---------|----------+-------------------+
| subject | activity | tbody_mean_x|tbody_mean_y|tbody_mean_z|tbody_sd_x|body_sd_y|tbody_sd_z|tgravity_sd_x .... |
| subject | activity | tbody_mean_x|tbody_mean_y|tbody_mean_z|tbody_sd_x|body_sd_y|tbody_sd_z|tgravity_sd_x .... |
|  ...    |    ...   |                                     ....                                                  |
+---------+----------+-------------------------------------------------------------------------------------------+

Where each row has unfriendly names and diferent info. Target is split the data into groups by: * body/gravity/etc * mean/sd

getting a data frame like this: 1.- Subject: Id of subject 2.- Activity: Factor of activities 3.- Object: Measured object: Body, Gravity, ... 4.- Measure: Type of data: mean or standard deviation 5.- X: X value 6.- Y: Y value 7.- Z: Z value

Flow is:

1.- for each main pattern (body, gravity, etc.) split the data
    1.1.- Set the column object to pattern
    1.2.- For each measure (mean, sd) split the data
        1.2.1 .- Set the column Measure to its value
        1.2.2 .- Set the names to X,Y,Z
        1.2.2 .-  Combine columns subject/activity with the data frame
    1.3.- Combine rows in a data frame


Grandez/JGGTidy documentation built on May 6, 2019, 6:30 p.m.