column_window_functions: Window functions for Column operations

Description Usage Arguments Details Note Examples

Description

Window functions defined for Column.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13

Arguments

x

the Column or WindowSpec to use for rank functions. This will default to ordering by monotonically_increasing_id(). If using WindowSpec you can also add partitioning columns, though any grouping vars will be respected here.

...

additional argument(s).

n

an integer of the number of tiles used in ntile

Details

rank: Returns the rank of rows within a window partition. The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth. This is equivalent to the RANK function in SQL. This can be used with either a Column or a WindowSpec.

min_rank: An alias for rank. In Spark rank and functions as min_rank. This can be used with either a Column or a WindowSpec.

dense_rank: Returns the rank of rows within a window partition, without any gaps. The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth. This is equivalent to the DENSE_RANK function in SQL. This can be used with either a Column or a WindowSpec.

percent_rank: Returns the relative rank (i.e. percentile) of rows within a window partition. This is computed by: (rank of row in its partition - 1) / (number of rows in the partition - 1). This is equivalent to the PERCENT_RANK function in SQL. This can be used with either a Column or a WindowSpec.

cume_dist: Returns the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row: (number of values before and including x) / (total number of rows in the partition). This is equivalent to the CUME_DIST function in SQL. This can be used with either a Column or a WindowSpec.

row_number: Returns a sequential number starting at 1 within a window partition. This is equivalent to the ROW_NUMBER function in SQL. This can be used with either a Column, WindowSpec, or without an argument, which will order by monotonically_increasing_id().

ntile: Returns the ntile group id (from 1 to n inclusive) in an ordered window partition. For example, if n is 4, the first quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will get 4. This is equivalent to the NTILE function in SQL.

Note

rank since 1.6.0

rank since 1.6.0

dense_rank since 1.6.0

percent_rank since 1.6.0

cume_dist since 1.6.0

row_number since 1.6.0

ntile since 1.6.0

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
## Not run: 
# with column names
spark_tbl(iris) %>%
  mutate(lead = lead(Petal_Width, 1),
         lag = lag(Petal_Width, 2),
         n = row_number(),
         rank = rank(Petal_Width),
         dense_rank = dense_rank(Petal_Width),
         percent_rank = percent_rank(Petal_Width),
         cume_dist = cume_dist(Petal_Width),
         tiles_4 = ntile(Petal_Width, 4)) %>%
 collect

# with grouping things are more efficient
spark_tbl(iris) %>%
  group_by(Species) %>%
  mutate(lead = lead(Petal_Width, 1),
         lag = lag(Petal_Width, 2),
         n = row_number(),
         rank = rank(Petal_Width),
         dense_rank = dense_rank(Petal_Width),
         percent_rank = percent_rank(Petal_Width),
         cume_dist = cume_dist(Petal_Width),
         tiles_4 = ntile(Petal_Width, 4)) %>%
  collect

# you can also use a WindowSpec() to acheive the same thing
wndw <- partitionBy(windowOrderBy("Petal_Width"), "Species")
spark_tbl(iris) %>%
  mutate(n = row_number(),
         rank = rank(wndw),
         dense_rank = dense_rank(wndw),
         percent_rank = percent_rank(wndw),
         cume_dist = cume_dist(wndw),
         tiles_4 = ntile(wndw, 4)) %>%
  collect

## End(Not run)

danzafar/tidyspark documentation built on Sept. 30, 2020, 12:19 p.m.