knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The proc_means()
function simulates a SAS® PROC MEANS procedure. It is used to
generate summary statistics on numeric variables. The function is both
interactive and returns datasets.
Let's again create some sample data. This sample data is identical to that
created for the proc_freq()
tutorial, but has an additional grouping
variable "g":
# Create sample data dat <- read.table(header = TRUE, text = 'x y z g 6 A 60 P 6 A 70 P 2 A 100 P 2 B 10 P 3 B 67 Q 2 C 81 Q 3 C 63 Q 5 C 55 Q') # View sample data dat x y z g 1 6 A 60 P 2 6 A 70 P 3 2 A 100 P 4 2 B 10 P 5 3 B 67 Q 6 2 C 81 Q 7 3 C 63 Q 8 5 C 55 Q
If no parameters are specified, the proc_means()
function will calculate
N, Means, Standard Deviation, Minimum, and Maximum on all numeric variables.
Note that the options
statement has been added to pass CRAN checks.
When you are running code samples, this statement may be omitted.
# Turn off printing for CRAN checks options("procs.print" = FALSE) # No parameters proc_means(dat)
If you don't want statistics on all numeric variables, you may specify
variables on the var
parameter:
# Specific variable proc_means(dat, var = x)
The proc_means()
function has a stats
parameter that allows you to control
which statistics are generated. There are many statistics keywords. Here is
a sample of some of the most frequently used keywords:
Keyword | Description |
---|---|
N | Number of Observations |
NMISS | Number of missing observations |
MEAN | Arithmetic mean |
STD | Standard Deviation |
MIN | Minimum |
MAX | Maximum |
SUM | Sum of observations |
MEDIAN | 50th percentile |
P1 | 1st percentile |
P5 | 5st percentile |
P10 | 10th percentile |
P90 | 90th percentile |
P95 | 95th percentile |
P99 | 99th percentile |
Q1 | First Quartile |
Q3 | Third Quartile |
Now that we know some statistics keywords, let's practice using them. Here is an example which calculates the median, sum, first quartile, and third quartile for all numeric variables in our sample data:
# Custom statistics options proc_means(dat, stats = v(median, sum, q1, q3))
Similar to the proc_freq()
function, proc_means()
can return datasets.
There are three options: "out", "report", and "none". The
"out" option returns datasets meant for further manipulation and analysis, and
is the default.
The "report" keyword requests the exact datasets used in the interactive
report. Specifying either one of these options will cause the function
to return data.
Here is an example that shows the difference in the "report" and "out" options:
# Output dataset using "report" option res1 <- proc_means(dat, stats = v(median, sum, q1, q3), output = report) # View results res1 # VAR MEDIAN SUM Q1 Q3 # 1 x 3 29 2.0 5.5 # 2 z 65 506 57.5 75.5 # Output dataset using "out" option res2 <- proc_means(dat, stats = v(median, sum, q1, q3), output = out) # View results res2 # TYPE FREQ VAR MEDIAN SUM Q1 Q3 # 1 0 8 x 3 29 2.0 5.5 # 2 0 8 z 65 506 57.5 75.5
As can be seen in the above example, the "out" dataset includes additional variables for TYPE and FREQ. These additional variables can be turned off with options:
# Turn off TYPE and FREQ variables res3 <- proc_means(dat, stats = v(median, sum, q1, q3), output = all, options = v(notype, nofreq)) # View results res3 # VAR MEDIAN SUM Q1 Q3 # 1 x 3 29 2.0 5.5 # 2 z 65 506 57.5 75.5
The proc_means()
function provides two grouping parameters: class
and by
.
These parameters identify a variable or variables for subsetting the input
data. While these parameters have similar capabilities, there are some
difference between them. The differences can be examined by comparing
the two function calls.
# Class grouping res1 <- proc_means(dat, stats = v(median, sum, q1, q3), class = y, options = v(maxdec = 4))
Below is the output dataset from the class
parameter. Notice that
summary values have been provided for each variable, in addition to the subsets
by the class variable. The summary rows are identifed by TYPE = 0, while the
subset rows are TYPE = 1.
# View results - class res1 # CLASS TYPE FREQ VAR MEDIAN SUM Q1 Q3 # 1 <NA> 0 8 x 3.0 29 2.0 5.5 # 2 <NA> 0 8 z 65.0 506 57.5 75.5 # 3 A 1 3 x 6.0 14 2.0 6.0 # 4 A 1 3 z 70.0 230 60.0 100.0 # 5 B 1 2 x 2.5 5 2.0 3.0 # 6 B 1 2 z 38.5 77 10.0 67.0 # 7 C 1 3 x 3.0 10 2.0 5.0 # 8 C 1 3 z 63.0 199 55.0 81.0
Here is the same analysis using the by
parameter instead of the class
parameter:
# By grouping res2 <- proc_means(dat, stats = v(median, sum, q1, q3), by = y, options = v(maxdec = 4))
Notice that with the by
parameter, separate tables are created for each
by group on the interactive report.
Now let's look at the output dataset:
# View results - by res2 # BY TYPE FREQ VAR MEDIAN SUM Q1 Q3 # 1 A 0 3 x 6.0 14 2 6 # 2 A 0 3 z 70.0 230 60 100 # 3 B 0 2 x 2.5 5 2 3 # 4 B 0 2 z 38.5 77 10 67 # 5 C 0 3 x 3.0 10 2 5 # 6 C 0 3 z 63.0 199 55 81
The output dataset is also different from the class
output.
While the TYPE variables exists on the output dataset,
the output data for the by group does not include the summary rows. The
summary rows are a feature of the class
variable. You should select
the grouping parameter that most suits your needs.
The proc_means()
function can perform analysis with multiple grouping variables.
First let's examine what happens when we pass multiple grouping variables to
the class
parameter:
# Class grouping - two variables res1 <- proc_means(dat, stats = v(median, sum, q1, q3), class = v(g, y), options = v(maxdec = 0))
Here is the output dataset:
# View results - two class variables res1 # CLASS1 CLASS2 TYPE FREQ VAR MEDIAN SUM Q1 Q3 # 1 <NA> <NA> 0 8 x 3.0 29 2.0 5.5 # 2 <NA> <NA> 0 8 z 65.0 506 57.5 75.5 # 3 <NA> A 1 3 x 6.0 14 2.0 6.0 # 4 <NA> A 1 3 z 70.0 230 60.0 100.0 # 5 <NA> B 1 2 x 2.5 5 2.0 3.0 # 6 <NA> B 1 2 z 38.5 77 10.0 67.0 # 7 <NA> C 1 3 x 3.0 10 2.0 5.0 # 8 <NA> C 1 3 z 63.0 199 55.0 81.0 # 9 P <NA> 2 4 x 4.0 16 2.0 6.0 # 10 P <NA> 2 4 z 65.0 240 35.0 85.0 # 11 Q <NA> 2 4 x 3.0 13 2.5 4.0 # 12 Q <NA> 2 4 z 65.0 266 59.0 74.0 # 13 P A 3 3 x 6.0 14 2.0 6.0 # 14 P A 3 3 z 70.0 230 60.0 100.0 # 15 P B 3 1 x 2.0 2 2.0 2.0 # 16 P B 3 1 z 10.0 10 10.0 10.0 # 17 Q B 3 1 x 3.0 3 3.0 3.0 # 18 Q B 3 1 z 67.0 67 67.0 67.0 # 19 Q C 3 3 x 3.0 10 2.0 5.0 # 20 Q C 3 3 z 63.0 199 55.0 81.0
Observe that the function produces statistics for each set of combinations
of the class
variable. Each level of combinations is identified by the
TYPE value. To turn off the class combinations, pass the "nway" option.
Now let's see what happens when we use the by
parameter with
two variables:
# By grouping - two variables res2 <- proc_means(dat, stats = v(median, sum, q1, q3), by = v(g, y), options = v(maxdec = 0))
Here is the output dataset for the by
parameter:
# View results - two by variables res2 # BY1 BY2 TYPE FREQ VAR MEDIAN SUM Q1 Q3 # 1 P A 0 3 x 6 14 2 6 # 2 P A 0 3 z 70 230 60 100 # 3 P B 0 1 x 2 2 2 2 # 4 P B 0 1 z 10 10 10 10 # 5 Q B 0 1 x 3 3 3 3 # 6 Q B 0 1 z 67 67 67 67 # 7 Q C 0 3 x 3 10 2 5 # 8 Q C 0 3 z 63 199 55 81
Finally, let's see what happens when we specify both by
and class
parameters:
# By grouping - by and class res3 <- proc_means(dat, stats = v(median, sum, q1, q3), by = g, class = y, options = v(maxdec = 0))
# View results - by and class res3 # BY CLASS TYPE FREQ VAR MEDIAN SUM Q1 Q3 # 1 P <NA> 0 4 x 4 16 2.0 6 # 2 P <NA> 0 4 z 65 240 35.0 85 # 3 P A 1 3 x 6 14 2.0 6 # 4 P A 1 3 z 70 230 60.0 100 # 5 P B 1 1 x 2 2 2.0 2 # 6 P B 1 1 z 10 10 10.0 10 # 7 Q <NA> 0 4 x 3 13 2.5 4 # 8 Q <NA> 0 4 z 65 266 59.0 74 # 9 Q B 1 1 x 3 3 3.0 3 # 10 Q B 1 1 z 67 67 67.0 67 # 11 Q C 1 3 x 3 10 2.0 5 # 12 Q C 1 3 z 63 199 55.0 81
The proc_means()
function also offers options for data shaping. The
shaping options can reduce the number of transformations needed to
get to your target table.
There are three shaping options: "wide", "long", and "stacked". The "wide" option is the default, and places the statistics in columns and variables in rows. The "long" option places statistics in rows and variables in columns. The "stacked" option puts both statistics and variables in rows.
The following example illustrates the differences between these data shaping options:
# Shape wide res1 <- proc_means(dat, stats = v(median, sum, q1, q3), output = wide) # Wide results res1 # TYPE FREQ VAR MEDIAN SUM Q1 Q3 # 1 0 8 x 3 29 2.0 5.5 # 2 0 8 z 65 506 57.5 75.5 # Shape long res2 <- proc_means(dat, stats = v(median, sum, q1, q3), output = long) # Long results res2 # TYPE FREQ STAT x z # 1 0 8 MEDIAN 3.0 65.0 # 2 0 8 SUM 29.0 506.0 # 3 0 8 Q1 2.0 57.5 # 4 0 8 Q3 5.5 75.5 # Shape stacked res3 <- proc_means(dat, stats = v(median, sum, q1, q3), output = stacked) # Stacked results res3 # TYPE FREQ VAR STAT VALUES # 1 0 8 x MEDIAN 3.0 # 2 0 8 x SUM 29.0 # 3 0 8 x Q1 2.0 # 4 0 8 x Q3 5.5 # 5 0 8 z MEDIAN 65.0 # 6 0 8 z SUM 506.0 # 7 0 8 z Q1 57.5 # 8 0 8 z Q3 75.5
Next: T-Test Function
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.