knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
ANLY 500 will focus on the foundations of:
The first question we must ask ourselves in this course is: What is Analytics?
The utilization of:
The focus of data analytics can be defined under three scopes, including:
knitr::include_graphics("pictures/introDA1/descriptive.png")
Description: Understand the Historical Trend of Sunspots from 1749 to 2013.
library(datasets) data("sunspot.month") # special way to load embedded data head(sunspot.month)
str(sunspot.month)
summary(sunspot.month)
library(ggplot2) sunspot.month <- as.data.frame(sunspot.month) sunspot.month$Time <- 1:nrow(sunspot.month) ggplot(sunspot.month, aes(x = Time, y = x)) + geom_point(alpha = 0.5) + ylab("Number of Sunspots") + xlab("Time") + theme_classic()
knitr::include_graphics("pictures/introDA1/predictive.png")
library(quantmod) start <- as.Date(Sys.Date()-(365*5)) end <- as.Date(Sys.Date()-2) getSymbols("AMZN", src = "yahoo", from = start, to = end) str(AMZN)
predictive_model <- lm(formula = AMZN.Close ~ AMZN.High + AMZN.Low + AMZN.Volume, data = AMZN[1:1199,]) summary(predictive_model)
par(mfrow=c(2,3)) plot(predictive_model,1) plot(predictive_model,2) plot(predictive_model,3) plot(predictive_model,4) plot(predictive_model,5)
n <- length(AMZN[,1]) prediction <- stats::predict(predictive_model, AMZN[1200:n,]) tail(data.frame(prediction))
plot(prediction, type = "l")
knitr::include_graphics("pictures/introDA1/prescriptive.png")
Analytics is the discovery, interpretation and communication of meaningful patterns or summary of data using data analytics.
Now we should be asking the question: What is Data Analytics?
High level analysis techniques commonly used in data analytics include:
However, two other types of analysis may be considered.
Quantitative data analysis: involves analysis of numerical data with quantifiable variables that can be compared or measured statistically.
Qualitative data analysis: it is more interpretive. It focuses on understanding the content of non-numerical data like text, images, audio and video, including common phrases, themes and points of view.
knitr::include_graphics("pictures/introDA1/research_process.png")
knitr::include_graphics("pictures/introDA1/initial_obs.png")
In other words, formulate a question that needs to be answered.
Test the concept:
Theory:
knitr::include_graphics("pictures/introDA1/cartoon_theory.png")
Hypothesis:
Falsification:
knitr::include_graphics("pictures/introDA1/kp_false.png")
Independent Variable:
Dependent Variable:
Data is a set of values/measurements of quantitative or qualitative variables.
In a dataset, we can distinguish two types of variables:
Includes the following:
R stores categorical variables as a factor or character.
Definition - a binary variable is only two categories.
Definition - A nominal variable is more than two categories.
Whether someone is an Assistant, Associate, or Full Professor.
Definition - A ordinal variable is the same as a nominal, but the categories have a logical order.
In addition to being able to classify values into categories, you can order the categories: first, second, third
Includes the following:
Definition - A interval variable is equal intervals on the variable. It represents equal differences in the property being measured. This variable also does not have a true zero.
Definition - A ratio variable is the same as an interval variable, but the ratios of scores on the scale must also make sense. This variable does have a true zero.
Measurement Error: - aka observational error
Definition - The discrepancy between the actual value we're trying to measure, and the number we use to represent that value.
Validity:
Including the following:
Reliability:
Test-Retest Reliability:
To use measures in any research and test them we must now understand the following: How to Measure?
It is different for certain types of research, including:
knitr::include_graphics("pictures/introDA1/cross_sectional_research_study.png")
Definition - One or more variables is systematically manipulated to see their effect (alone or in combination) on an outcome variable.
Cause and Effect (Hume, 1748)
Confounding variables: the 'Tertium Quid'
Ruling out confounds (Mill, 1865)
For instance:
Between-group/between-subject/independent
Repeated-measures (within-subject)
Systematic Variation
Unsystematic Variation
Randomization
First, populations and samples should be understood so that your analysis is not misleading when interpreting results.
Population
Sample
A simple statistical model can be used to analyze data.
For instance, the mean is a hypothetical value.
tapply(iris$Sepal.Length, iris$Species, mean)
Numbers estimated from the entire dataset is a representation of the population.
The numbers estimated from a single test/study/experiment are considered a sample.
Parameters = Greek Symbols
sample <- iris[sample(nrow(iris), 15), ] tapply(sample$Sepal.Length, sample$Species, mean) #sample tapply(iris$Sepal.Length, iris$Species, mean) #population
To analyze the data and generate interpretable results the following statistical models can be used:
In this lecture, you have learned:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.