title: "Getting and cleaning data" output: github_document
{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
The objetive of this program is get ad make tidy som information about the accelerometers from the Samsung Galaxy S smartphone.
All info is provided by: UCI - Machine Learning Repository
Files can be downloaded in zip format, and all information about the data can be found there.
Although data is splitted into several files, I'm focused on:
And inside the data, only get information about medians and standard deviations
I don't assume anything about data but structure of zipped file and naming conventions, that is:
1.- Activity_labels.txt and features.txt are in root tree
2.- There are several folders according the requeriments for the machine learning: train, test, val, ...
3.- Each folder has a name (xxx) and contains three files:
3.1 subject_xxx.txt Indicating the subject Id
3.2 y_xxx.txt Indicating the activity of subject
3.3 x_xxx.txt The observations
Main features are:
1.- Make the code as abstract(generic) as possible 2.- Minimize the use of memory 3.- Minimize readings
There are two basic process:
1.- Load the data into memory
2.- Make it tidy
Data is loaded directly from Internet or use a previously downloaded file according the prefix of file name (http or not) and it is readed without decompressing.
Flow is:
1.- loadFile -> get access to file
2.- Process features / Select columna names to read in order to avoid read the full data
3.- For each directory (train, test, etc)
3.1 .- Load subject file
3.2 .- Load Y axis
3.3 .- Load selected columns from X data
4.- Mark Subject and activities as factor
The relationship between files is:
+--------+ 1 1 +---------+ 1 1 +---------+
| X_data | <------> | subject | <------> | Y |
+--------+ By row +--------+ By row +---------+
The resulting dataframe is:
+---------+----------+-----------------------------+
| subject | activity | vector of selected columns |
| subject | activity | vector of selected columns |
| ... | ... | .... |
+---------+----------+-----------------------------+
Once data are loaded into memory they are like this:
+---------+----------+-------------+------------+-------------+---------+---------|----------+-------------------+
| subject | activity | tbody_mean_x|tbody_mean_y|tbody_mean_z|tbody_sd_x|body_sd_y|tbody_sd_z|tgravity_sd_x .... |
| subject | activity | tbody_mean_x|tbody_mean_y|tbody_mean_z|tbody_sd_x|body_sd_y|tbody_sd_z|tgravity_sd_x .... |
| ... | ... | .... |
+---------+----------+-------------------------------------------------------------------------------------------+
Where each row has unfriendly names and diferent info. Target is split the data into groups by: * body/gravity/etc * mean/sd
getting a data frame like this: 1.- Subject: Id of subject 2.- Activity: Factor of activities 3.- Object: Measured object: Body, Gravity, ... 4.- Measure: Type of data: mean or standard deviation 5.- X: X value 6.- Y: Y value 7.- Z: Z value
Flow is:
1.- for each main pattern (body, gravity, etc.) split the data
1.1.- Set the column object to pattern
1.2.- For each measure (mean, sd) split the data
1.2.1 .- Set the column Measure to its value
1.2.2 .- Set the names to X,Y,Z
1.2.2 .- Combine columns subject/activity with the data frame
1.3.- Combine rows in a data frame
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.