Home

/

GitHub

/

jimtyhurst/stackR

/

vignettes/EDA.md

stackR: Functions to Explore, Analyze, and Visualize Open Data from stackexchange.com

Exploratory Data Analysis

Jim Tyhurst 2019-04-28

Package dependencies
Data source
Data preparation
- Duplicate question counts
- Duplicate question content
Exploratory analysis of duplicates
- Percentage of duplicates by language
- Distribution of number of duplicates

🔻 Work in Process … 🔺

Source code at: EDA.Rmd

library(stackR)
library(readr)
library(forcats)
library(dplyr)
library(ggplot2)
library(lubridate)

Data was obtained from StackExchange. They license their data with the Creative Commons Attribution-ShareAlike license.

See Stack Overflow’s blog post, “Attribution Required”.

Stack Exchange Data Explorer is an online tool for accessing Stack Exchange data. I used this tool to count posts having certain characteristics and to download posts satisfying certain conditions.

I explored these data sources briefly, but did not use them for this analysis.

Zipped data files

Rather than trying to install these large files on my local machine, I decided to run SQL queries against the original data using the Stack Exchange Data Explorer and then download subsets of data for processing locally.

For example, as of 2019-04-07 there are 7 CSV files for posts in Stack Overflow:

| Name | Last modified | Size | | -------------------------------- | ----------------- | ------ | | stackoverflow.com-Badges.7z | 04-Mar-2019 14:04 | 218.8M | | stackoverflow.com-Comments.7z | 04-Mar-2019 14:25 | 3.9G | | stackoverflow.com-PostHistory.7z | 04-Mar-2019 15:26 | 23.0G | | stackoverflow.com-PostLinks.7z | 04-Mar-2019 15:27 | 76.8M | | stackoverflow.com-Posts.7z | 04-Mar-2019 16:32 | 13.2G | | stackoverflow.com-Tags.7z | 04-Mar-2019 16:32 | 768.9K | | stackoverflow.com-Users.7z | 04-Mar-2019 16:33 | 431.4M | | stackoverflow.com-Votes.7z | 04-Mar-2019 16:38 | 997.3M |

Google BigQuery

This extract of the Stack Exchange data only covers 2008-07-31 to 2016-06-12, whereas the data available through the Stack Exchange Data Explorer covers 2008-07-31 to 2019-04-07. So I decided not to use Google BigQuery, since the data does not cover recent years.

Blog post: “Google BigQuery public datasets now include Stack Overflow Q\&A”
Compose Query: bigquery-public-data:stackoverflow

Note that the Google BigQuery table structure is slightly different than the schema used by Stack Exchange. For example, the field names in Google BigQuery follow a snake-case naming convention (post_history_type_id), whereas the Stack Exchange database uses camel-case (PostHistoryTypeId).

** Sample Query **

-- Google BigQuery
-- Counts the number of Posts closed as a duplicate.
SELECT COUNT(DISTINCT d.id) AS n_closed_duplicate_questions
FROM [bigquery-public-data:stackoverflow.posts_questions] d  -- d=duplicate
LEFT JOIN [bigquery-public-data:stackoverflow.post_history] ph ON ph.post_id = d.id
LEFT JOIN [bigquery-public-data:stackoverflow.post_links] pl ON pl.post_id = d.id
LEFT JOIN [bigquery-public-data:stackoverflow.posts_questions] o ON o.id = pl.related_post_id  -- o=original
WHERE
  d.post_type_id = 1  -- 1=Question
  AND pl.link_type_id = 3  -- 3=duplicate
  AND ph.post_history_type_id = 10  -- 10=Post Closed

I used two queries to count total questions and duplicate questions for 20 of the 25 most popular programming languages:

-- Counts total number of question posts with a given language tag.
SELECT
  COUNT(DISTINCT(op.Id))
FROM posts op 
WHERE
  op.posttypeid = 1            -- Question post
  AND op.Tags like '%<java>%'  -- Language tag

-- Counts number of question posts marked as a duplicate for a given language tag.
SELECT COUNT(DISTINCT dp.id)
FROM posts dp
  JOIN postlinks pl ON dp.id = pl.postid
WHERE
  dp.posttypeid = 1 -- Question post
  AND pl.linktypeid = 3 -- duplicate
  AND dp.tags like '%<java>%';

The resulting table of data is available as a CSV file, duplicates-totals-pct.csv (download).

The following query was used to obtain data for duplicate posts. When you run the query in the Stack Exchange Data Explorer, it prompts for 3 parameter values:

PageSize : Number of rows to return. 50,000 seems to be the maximum that will be returned from any one query.
Page Number : Enables you to download a large dataset by retrieving one page at a time. For example, use 1 for the first “page”, i.e. subset of data, 2 for the second page, etc.
Tags : A string to match the posts.tags field. For example, '%<python>%' will match an instance of the tags field that contains the <python> tag. Note: tags all seem to be lower case and surrounded by < … >.

-- Duplicate pairs of questions related to a tag.
-- Derived from Jordan Witte's query:
--   https://data.stackexchange.com/stackoverflow/query/1029675/duplicate-pairs
--   https://github.com/Eric-D-Stevens/Data_Science_Meetup_April/blob/master/sql/basic_fetch_duplicate_pairs.sql

DECLARE @PageSize int = ##PageSize##
DECLARE @PageNumber int = ##PageNumber##
DECLARE @Tags nvarchar = ##Tags##

SELECT 
  op.Id OriginalPostId , 
  op.PostTypeId OriginalPostPostTypeId , 
  op.AcceptedAnswerId OriginalPostAcceptedAnswerId , 
  op.ParentId OriginalPostParentId , 
  op.CreationDate OriginalPostCreationDate , 
  op.DeletionDate OriginalPostDeletionDate , 
  op.Score OriginalPostScore , 
  op.ViewCount OriginalPostViewCount , 
  op.Body OriginalPostBody , 
  op.OwnerUserId OriginalPostOwnerUserId , 
  op.OwnerDisplayName OriginalPostOwnerDisplayName , 
  op.LastEditorUserId OriginalPostLastEditorUserId , 
  op.LastEditorDisplayName OriginalPostLastEditorDisplayName , 
  op.LastEditDate OriginalPostLastEditDate , 
  op.LastActivityDate OriginalPostLastActivityDate , 
  op.Title OriginalPostTitle , 
  op.Tags OriginalPostTags , 
  op.AnswerCount OriginalPostAnswerCount , 
  op.CommentCount OriginalPostCommentCount , 
  op.FavoriteCount OriginalPostFavoriteCount , 
  op.ClosedDate OriginalPostClosedDate , 
  op.CommunityOwnedDate OriginalPostCommunityOwnedDate , 

  dupl.Id DuplicatePostId , 
  dupl.PostTypeId DuplicatePostPostTypeId , 
  dupl.AcceptedAnswerId DuplicatePostAcceptedAnswerId , 
  dupl.ParentId DuplicatePostParentId , 
  dupl.CreationDate DuplicatePostCreationDate , 
  dupl.DeletionDate DuplicatePostDeletionDate , 
  dupl.Score DuplicatePostScore , 
  dupl.ViewCount DuplicatePostViewCount , 
  dupl.Body DuplicatePostBody , 
  dupl.OwnerUserId DuplicatePostOwnerUserId , 
  dupl.OwnerDisplayName DuplicatePostOwnerDisplayName , 
  dupl.LastEditorUserId DuplicatePostLastEditorUserId , 
  dupl.LastEditorDisplayName DuplicatePostLastEditorDisplayName , 
  dupl.LastEditDate DuplicatePostLastEditDate , 
  dupl.LastActivityDate DuplicatePostLastActivityDate , 
  dupl.Title DuplicatePostTitle , 
  dupl.Tags DuplicatePostTags , 
  dupl.AnswerCount DuplicatePostAnswerCount , 
  dupl.CommentCount DuplicatePostCommentCount , 
  dupl.FavoriteCount DuplicatePostFavoriteCount , 
  dupl.ClosedDate DuplicatePostClosedDate , 
  dupl.CommunityOwnedDate DuplicatePostCommunityOwnedDate
FROM posts op 
  JOIN postlinks pl ON op.id = pl.relatedpostid
  JOIN posts dupl ON pl.postid = dupl.id
WHERE
  op.posttypeid = 1      -- Question post
  AND pl.linktypeid = 3  -- duplicate
  AND dupl.tags like @Tags  -- e.g. '%<python>%' for "contains python"
ORDER BY op.Id 
  OFFSET @PageSize * (@PageNumber - 1) ROWS
  FETCH NEXT @PageSize ROWS ONLY;

I used this query to download duplicates for 6 of the 25 most popular programming languages, choosing languages from the most popular (JavaScript) to language near the bottom of the list (R and Clojure):

C#
Clojure
Java
JavaScript
Python
R

raw_pcts <- system.file(
    "extdata", 
    "duplicates-totals-pct.csv", 
    package = "stackR"
  ) %>% 
  readr::read_csv() %>% 
  dplyr::mutate(pct_duplicates = (duplicate_posts / total_posts) * 100.0) %>% 
  dplyr::arrange(desc(pct_duplicates))
#> Parsed with column specification:
#> cols(
#>   language = col_character(),
#>   total_posts = col_double(),
#>   duplicate_posts = col_double()
#> )
kable(raw_pcts, digits = 2)

| language | total_posts | duplicate_posts | pct_duplicates | | :--------- | -----------: | ---------------: | --------------: | | rust | 11813 | 1333 | 11.28 | | bash | 110927 | 6395 | 5.77 | | r | 285818 | 14234 | 4.98 | | c | 301484 | 14758 | 4.90 | | shell | 69923 | 3126 | 4.47 | | c++ | 614289 | 27237 | 4.43 | | java | 1538549 | 64845 | 4.21 | | python | 1153996 | 46415 | 4.02 | | php | 1277793 | 48112 | 3.77 | | javascript | 1796439 | 62295 | 3.47 | | swift | 227614 | 7381 | 3.24 | | html | 817665 | 25873 | 3.16 | | css | 582623 | 17404 | 2.99 | | c# | 1304340 | 38646 | 2.96 | | kotlin | 22023 | 510 | 2.32 | | sql | 488290 | 11227 | 2.30 | | typescript | 79466 | 1519 | 1.91 | | powershell | 67326 | 1138 | 1.69 | | clojure | 15457 | 194 | 1.26 | | elixir | 6814 | 69 | 1.01 |

# Adds 'sorted_language' column, sorted by the percentage,
# for plotting.
sorted_pcts <- raw_pcts %>% 
  dplyr::mutate(
    sorted_language = forcats::fct_reorder(
      language, 
      pct_duplicates, 
      .desc = TRUE
    )
  )

sorted_pcts %>%
  ggplot(aes(x = sorted_language, y = pct_duplicates)) +
    geom_col() +
  scale_y_continuous(
    limits = c(0, 15), 
    minor_breaks = seq(0, 15, by = 1),
    breaks = seq(0, 15, by = 5),
  ) + 
  theme(
      panel.grid.minor = element_line(color="grey60", size=0.5),
      panel.grid.major = element_line(color="grey40", size=0.5),
      panel.background = element_rect(fill="snow2")
    ) +
    labs(
      title = "Percentage of duplicates by language", 
      x = "Language", 
      y = "% of Duplicates"
    )

Read the downloaded duplicates for the R language:

dups <- system.file(
    "extdata", 
    "duplicates_r.csv", 
    package = "stackR"
  ) %>% 
  readr::read_csv() 
print(dim(dups))
#> [1] 15453    44

Plot the distribution of the number of duplicates per original question:

dup_distr <- dups %>% 
  dplyr::group_by(OriginalPostId) %>% 
  dplyr::summarize(n = n()) %>% 
  dplyr::select(OriginalPostId, n) %>% 
  dplyr::group_by(n) %>% 
  dplyr::summarize(qty = n()) %>% 
  dplyr::arrange(n)
kable(dup_distr)

| n | qty | | --: | ---: | | 1 | 5006 | | 2 | 1004 | | 3 | 398 | | 4 | 179 | | 5 | 106 | | 6 | 49 | | 7 | 52 | | 8 | 31 | | 9 | 22 | | 10 | 25 | | 11 | 11 | | 12 | 14 | | 13 | 6 | | 14 | 7 | | 15 | 11 | | 16 | 5 | | 17 | 6 | | 18 | 5 | | 19 | 5 | | 20 | 4 | | 21 | 2 | | 22 | 8 | | 23 | 2 | | 24 | 2 | | 25 | 3 | | 26 | 2 | | 27 | 1 | | 29 | 3 | | 30 | 2 | | 31 | 5 | | 32 | 1 | | 36 | 2 | | 38 | 1 | | 39 | 3 | | 40 | 3 | | 44 | 1 | | 47 | 1 | | 48 | 2 | | 50 | 1 | | 51 | 2 | | 52 | 2 | | 55 | 1 | | 57 | 1 | | 59 | 1 | | 62 | 2 | | 64 | 1 | | 71 | 1 | | 73 | 1 | | 88 | 1 | | 99 | 1 | | 100 | 1 | | 102 | 1 | | 136 | 1 | | 140 | 1 | | 188 | 1 | | 287 | 1 | | 335 | 1 |

dup_distr %>% 
  ggplot(aes(x = n, y = qty)) +
    geom_point()

🔻 Work in Process … I am still analyzing this data. 🔺

jimtyhurst/stackR documentation built on May 17, 2019, 6:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jimtyhurst/stackR
Functions to Explore, Analyze, and Visualize Open Data from stackexchange.com

vignettes/EDA.md
In jimtyhurst/stackR: Functions to Explore, Analyze, and Visualize Open Data from stackexchange.com

Exploratory Data Analysis

Package dependencies

Data source

Attribution of data

Schema

SQL access to Stack Exchange data

Other sources of Stack Exchange data are available

Zipped data files

Google BigQuery

Data preparation

Duplicate question counts

Duplicate question content

Exploratory analysis of duplicates

Percentage of duplicates by language

Distribution of number of duplicates

R Package Documentation

Browse R Packages

We want your feedback!

jimtyhurst/stackR Functions to Explore, Analyze, and Visualize Open Data from stackexchange.com

vignettes/EDA.md In jimtyhurst/stackR: Functions to Explore, Analyze, and Visualize Open Data from stackexchange.com

Exploratory Data Analysis

Package dependencies

Data source

Attribution of data

Schema

SQL access to Stack Exchange data

Other sources of Stack Exchange data are available

Zipped data files

Google BigQuery

Data preparation

Duplicate question counts

Duplicate question content

Exploratory analysis of duplicates

Percentage of duplicates by language

Distribution of number of duplicates

R Package Documentation

Browse R Packages

We want your feedback!

jimtyhurst/stackR
Functions to Explore, Analyze, and Visualize Open Data from stackexchange.com

vignettes/EDA.md
In jimtyhurst/stackR: Functions to Explore, Analyze, and Visualize Open Data from stackexchange.com