Jim Tyhurst 2019-04-28
🔻 Work in Process … 🔺
Source code at: EDA.Rmd
library(stackR)
library(readr)
library(forcats)
library(dplyr)
library(ggplot2)
library(lubridate)
Data was obtained from StackExchange. They license their data with the Creative Commons Attribution-ShareAlike license.
See Stack Overflow’s blog post, “Attribution Required”.
Stack Exchange Data Explorer is an online tool for accessing Stack Exchange data. I used this tool to count posts having certain characteristics and to download posts satisfying certain conditions.
I explored these data sources briefly, but did not use them for this analysis.
Rather than trying to install these large files on my local machine, I decided to run SQL queries against the original data using the Stack Exchange Data Explorer and then download subsets of data for processing locally.
For example, as of 2019-04-07 there are 7 CSV files for posts in Stack Overflow:
| Name | Last modified | Size | | -------------------------------- | ----------------- | ------ | | stackoverflow.com-Badges.7z | 04-Mar-2019 14:04 | 218.8M | | stackoverflow.com-Comments.7z | 04-Mar-2019 14:25 | 3.9G | | stackoverflow.com-PostHistory.7z | 04-Mar-2019 15:26 | 23.0G | | stackoverflow.com-PostLinks.7z | 04-Mar-2019 15:27 | 76.8M | | stackoverflow.com-Posts.7z | 04-Mar-2019 16:32 | 13.2G | | stackoverflow.com-Tags.7z | 04-Mar-2019 16:32 | 768.9K | | stackoverflow.com-Users.7z | 04-Mar-2019 16:33 | 431.4M | | stackoverflow.com-Votes.7z | 04-Mar-2019 16:38 | 997.3M |
This extract of the Stack Exchange data only covers 2008-07-31 to 2016-06-12, whereas the data available through the Stack Exchange Data Explorer covers 2008-07-31 to 2019-04-07. So I decided not to use Google BigQuery, since the data does not cover recent years.
Note that the Google BigQuery table structure is slightly different than the schema used by Stack Exchange. For example, the field names in Google BigQuery follow a snake-case naming convention (post_history_type_id), whereas the Stack Exchange database uses camel-case (PostHistoryTypeId).
** Sample Query **
-- Google BigQuery
-- Counts the number of Posts closed as a duplicate.
SELECT COUNT(DISTINCT d.id) AS n_closed_duplicate_questions
FROM [bigquery-public-data:stackoverflow.posts_questions] d -- d=duplicate
LEFT JOIN [bigquery-public-data:stackoverflow.post_history] ph ON ph.post_id = d.id
LEFT JOIN [bigquery-public-data:stackoverflow.post_links] pl ON pl.post_id = d.id
LEFT JOIN [bigquery-public-data:stackoverflow.posts_questions] o ON o.id = pl.related_post_id -- o=original
WHERE
d.post_type_id = 1 -- 1=Question
AND pl.link_type_id = 3 -- 3=duplicate
AND ph.post_history_type_id = 10 -- 10=Post Closed
I used two queries to count total questions and duplicate questions for 20 of the 25 most popular programming languages:
-- Counts total number of question posts with a given language tag.
SELECT
COUNT(DISTINCT(op.Id))
FROM posts op
WHERE
op.posttypeid = 1 -- Question post
AND op.Tags like '%<java>%' -- Language tag
-- Counts number of question posts marked as a duplicate for a given language tag.
SELECT COUNT(DISTINCT dp.id)
FROM posts dp
JOIN postlinks pl ON dp.id = pl.postid
WHERE
dp.posttypeid = 1 -- Question post
AND pl.linktypeid = 3 -- duplicate
AND dp.tags like '%<java>%';
The resulting table of data is available as a CSV file,
duplicates-totals-pct.csv
(download).
The following query was used to obtain data for duplicate posts. When you run the query in the Stack Exchange Data Explorer, it prompts for 3 parameter values:
1
for the first “page”,
i.e. subset of data, 2
for the second page, etc.posts.tags
field. For example,
'%<python>%'
will match an instance of the tags
field that
contains the <python>
tag. Note: tags all seem to be lower case
and surrounded by <
… >
.-- Duplicate pairs of questions related to a tag.
-- Derived from Jordan Witte's query:
-- https://data.stackexchange.com/stackoverflow/query/1029675/duplicate-pairs
-- https://github.com/Eric-D-Stevens/Data_Science_Meetup_April/blob/master/sql/basic_fetch_duplicate_pairs.sql
DECLARE @PageSize int = ##PageSize##
DECLARE @PageNumber int = ##PageNumber##
DECLARE @Tags nvarchar = ##Tags##
SELECT
op.Id OriginalPostId ,
op.PostTypeId OriginalPostPostTypeId ,
op.AcceptedAnswerId OriginalPostAcceptedAnswerId ,
op.ParentId OriginalPostParentId ,
op.CreationDate OriginalPostCreationDate ,
op.DeletionDate OriginalPostDeletionDate ,
op.Score OriginalPostScore ,
op.ViewCount OriginalPostViewCount ,
op.Body OriginalPostBody ,
op.OwnerUserId OriginalPostOwnerUserId ,
op.OwnerDisplayName OriginalPostOwnerDisplayName ,
op.LastEditorUserId OriginalPostLastEditorUserId ,
op.LastEditorDisplayName OriginalPostLastEditorDisplayName ,
op.LastEditDate OriginalPostLastEditDate ,
op.LastActivityDate OriginalPostLastActivityDate ,
op.Title OriginalPostTitle ,
op.Tags OriginalPostTags ,
op.AnswerCount OriginalPostAnswerCount ,
op.CommentCount OriginalPostCommentCount ,
op.FavoriteCount OriginalPostFavoriteCount ,
op.ClosedDate OriginalPostClosedDate ,
op.CommunityOwnedDate OriginalPostCommunityOwnedDate ,
dupl.Id DuplicatePostId ,
dupl.PostTypeId DuplicatePostPostTypeId ,
dupl.AcceptedAnswerId DuplicatePostAcceptedAnswerId ,
dupl.ParentId DuplicatePostParentId ,
dupl.CreationDate DuplicatePostCreationDate ,
dupl.DeletionDate DuplicatePostDeletionDate ,
dupl.Score DuplicatePostScore ,
dupl.ViewCount DuplicatePostViewCount ,
dupl.Body DuplicatePostBody ,
dupl.OwnerUserId DuplicatePostOwnerUserId ,
dupl.OwnerDisplayName DuplicatePostOwnerDisplayName ,
dupl.LastEditorUserId DuplicatePostLastEditorUserId ,
dupl.LastEditorDisplayName DuplicatePostLastEditorDisplayName ,
dupl.LastEditDate DuplicatePostLastEditDate ,
dupl.LastActivityDate DuplicatePostLastActivityDate ,
dupl.Title DuplicatePostTitle ,
dupl.Tags DuplicatePostTags ,
dupl.AnswerCount DuplicatePostAnswerCount ,
dupl.CommentCount DuplicatePostCommentCount ,
dupl.FavoriteCount DuplicatePostFavoriteCount ,
dupl.ClosedDate DuplicatePostClosedDate ,
dupl.CommunityOwnedDate DuplicatePostCommunityOwnedDate
FROM posts op
JOIN postlinks pl ON op.id = pl.relatedpostid
JOIN posts dupl ON pl.postid = dupl.id
WHERE
op.posttypeid = 1 -- Question post
AND pl.linktypeid = 3 -- duplicate
AND dupl.tags like @Tags -- e.g. '%<python>%' for "contains python"
ORDER BY op.Id
OFFSET @PageSize * (@PageNumber - 1) ROWS
FETCH NEXT @PageSize ROWS ONLY;
I used this query to download duplicates for 6 of the 25 most popular programming languages, choosing languages from the most popular (JavaScript) to language near the bottom of the list (R and Clojure):
raw_pcts <- system.file(
"extdata",
"duplicates-totals-pct.csv",
package = "stackR"
) %>%
readr::read_csv() %>%
dplyr::mutate(pct_duplicates = (duplicate_posts / total_posts) * 100.0) %>%
dplyr::arrange(desc(pct_duplicates))
#> Parsed with column specification:
#> cols(
#> language = col_character(),
#> total_posts = col_double(),
#> duplicate_posts = col_double()
#> )
kable(raw_pcts, digits = 2)
| language | total_posts | duplicate_posts | pct_duplicates | | :--------- | -----------: | ---------------: | --------------: | | rust | 11813 | 1333 | 11.28 | | bash | 110927 | 6395 | 5.77 | | r | 285818 | 14234 | 4.98 | | c | 301484 | 14758 | 4.90 | | shell | 69923 | 3126 | 4.47 | | c++ | 614289 | 27237 | 4.43 | | java | 1538549 | 64845 | 4.21 | | python | 1153996 | 46415 | 4.02 | | php | 1277793 | 48112 | 3.77 | | javascript | 1796439 | 62295 | 3.47 | | swift | 227614 | 7381 | 3.24 | | html | 817665 | 25873 | 3.16 | | css | 582623 | 17404 | 2.99 | | c# | 1304340 | 38646 | 2.96 | | kotlin | 22023 | 510 | 2.32 | | sql | 488290 | 11227 | 2.30 | | typescript | 79466 | 1519 | 1.91 | | powershell | 67326 | 1138 | 1.69 | | clojure | 15457 | 194 | 1.26 | | elixir | 6814 | 69 | 1.01 |
# Adds 'sorted_language' column, sorted by the percentage,
# for plotting.
sorted_pcts <- raw_pcts %>%
dplyr::mutate(
sorted_language = forcats::fct_reorder(
language,
pct_duplicates,
.desc = TRUE
)
)
sorted_pcts %>%
ggplot(aes(x = sorted_language, y = pct_duplicates)) +
geom_col() +
scale_y_continuous(
limits = c(0, 15),
minor_breaks = seq(0, 15, by = 1),
breaks = seq(0, 15, by = 5),
) +
theme(
panel.grid.minor = element_line(color="grey60", size=0.5),
panel.grid.major = element_line(color="grey40", size=0.5),
panel.background = element_rect(fill="snow2")
) +
labs(
title = "Percentage of duplicates by language",
x = "Language",
y = "% of Duplicates"
)
Read the downloaded duplicates for the R
language:
dups <- system.file(
"extdata",
"duplicates_r.csv",
package = "stackR"
) %>%
readr::read_csv()
print(dim(dups))
#> [1] 15453 44
Plot the distribution of the number of duplicates per original question:
dup_distr <- dups %>%
dplyr::group_by(OriginalPostId) %>%
dplyr::summarize(n = n()) %>%
dplyr::select(OriginalPostId, n) %>%
dplyr::group_by(n) %>%
dplyr::summarize(qty = n()) %>%
dplyr::arrange(n)
kable(dup_distr)
| n | qty | | --: | ---: | | 1 | 5006 | | 2 | 1004 | | 3 | 398 | | 4 | 179 | | 5 | 106 | | 6 | 49 | | 7 | 52 | | 8 | 31 | | 9 | 22 | | 10 | 25 | | 11 | 11 | | 12 | 14 | | 13 | 6 | | 14 | 7 | | 15 | 11 | | 16 | 5 | | 17 | 6 | | 18 | 5 | | 19 | 5 | | 20 | 4 | | 21 | 2 | | 22 | 8 | | 23 | 2 | | 24 | 2 | | 25 | 3 | | 26 | 2 | | 27 | 1 | | 29 | 3 | | 30 | 2 | | 31 | 5 | | 32 | 1 | | 36 | 2 | | 38 | 1 | | 39 | 3 | | 40 | 3 | | 44 | 1 | | 47 | 1 | | 48 | 2 | | 50 | 1 | | 51 | 2 | | 52 | 2 | | 55 | 1 | | 57 | 1 | | 59 | 1 | | 62 | 2 | | 64 | 1 | | 71 | 1 | | 73 | 1 | | 88 | 1 | | 99 | 1 | | 100 | 1 | | 102 | 1 | | 136 | 1 | | 140 | 1 | | 188 | 1 | | 287 | 1 | | 335 | 1 |
dup_distr %>%
ggplot(aes(x = n, y = qty)) +
geom_point()
🔻 Work in Process … I am still analyzing this data. 🔺
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.