tests/testthat/feeds.feedburner.com/SimplyStatistics-41c0e0.R

structure(list(url = "http://feeds.feedburner.com/SimplyStatistics?format=xml", 
    status_code = 200L, headers = structure(list(`content-type` = "text/xml; charset=UTF-8", 
        etag = "NbwqtZPDw/HK1/E6l1mhaLpha7Q", `last-modified` = "Mon, 24 Feb 2020 15:06:40 GMT", 
        `transfer-encoding` = "chunked", date = "Mon, 24 Feb 2020 16:00:18 GMT", 
        expires = "Mon, 24 Feb 2020 16:00:18 GMT", `cache-control` = "private, max-age=0", 
        `x-content-type-options` = "nosniff", `x-xss-protection` = "1; mode=block", 
        server = "GSE", `proxy-connection` = "Keep-Alive", connection = "Keep-Alive", 
        `content-encoding` = "gzip"), class = c("insensitive", 
    "list")), all_headers = list(list(status = 200L, version = "HTTP/1.1", 
        headers = structure(list(`content-type` = "text/xml; charset=UTF-8", 
            etag = "NbwqtZPDw/HK1/E6l1mhaLpha7Q", `last-modified` = "Mon, 24 Feb 2020 15:06:40 GMT", 
            `transfer-encoding` = "chunked", date = "Mon, 24 Feb 2020 16:00:18 GMT", 
            expires = "Mon, 24 Feb 2020 16:00:18 GMT", `cache-control` = "private, max-age=0", 
            `x-content-type-options` = "nosniff", `x-xss-protection` = "1; mode=block", 
            server = "GSE", `proxy-connection` = "Keep-Alive", 
            connection = "Keep-Alive", `content-encoding` = "gzip"), class = c("insensitive", 
        "list")))), cookies = structure(list(domain = logical(0), 
        flag = logical(0), path = logical(0), secure = logical(0), 
        expiration = structure(numeric(0), class = c("POSIXct", 
        "POSIXt")), name = logical(0), value = logical(0)), row.names = integer(0), class = "data.frame"), 
    content = charToRaw("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<?xml-stylesheet type=\"text/xsl\" media=\"screen\" href=\"/~d/styles/rss2full.xsl\"?><?xml-stylesheet type=\"text/css\" media=\"screen\" href=\"http://feeds.feedburner.com/~d/styles/itemcontent.css\"?><rss xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:feedburner=\"http://rssnamespace.org/feedburner/ext/1.0\" version=\"2.0\">\r\n  <channel>\r\n    <title>Simply Statistics</title>\r\n    <link>https://simplystatistics.org/index.xml</link>\r\n    <description>Recent content on Simply Statistics</description>\r\n    <generator>Hugo -- gohugo.io</generator>\r\n    <language>en-us</language>\r\n    <copyright>&amp;copy; 2011 - 2017. All rights reserved.</copyright>\r\n    <lastBuildDate>Wed, 04 Dec 2019 00:00:00 +0000</lastBuildDate>\r\n    \r\n    \r\n    <atom10:link xmlns:atom10=\"http://www.w3.org/2005/Atom\" rel=\"self\" type=\"application/rss+xml\" href=\"http://feeds.feedburner.com/SimplyStatistics\" /><feedburner:info uri=\"simplystatistics\" /><atom10:link xmlns:atom10=\"http://www.w3.org/2005/Atom\" rel=\"hub\" href=\"http://pubsubhubbub.appspot.com/\" /><item>\r\n      <title>Is Artificial Intelligence Revolutionizing Environmental Health?</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/xAmQ6vrlL-Q/</link>\r\n      <pubDate>Wed, 04 Dec 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/12/04/is-artificial-intelligence-revolutionizing-environmental-health/</guid>\r\n      <description>&lt;p&gt;&lt;em&gt;NOTE: This post was written by Kevin Elliott, Michigan State University; Nicole Kleinstreuer, National Institutes of Health; Patrick McMullen, ScitoVation; Gary Miller, Columbia University; Bhramar Mukherjee, University of Michigan; Roger D. Peng, Johns Hopkins University; Melissa Perry, The George Washington University; Reza Rasoulpour, Corteva Agriscience, and Elizabeth Boyle, National Academies of Sciences, Engineering, and Medicine. The full summary for the workshop on which this post is based can be obtained &lt;a href=\"https://www.nap.edu/catalog/25520/leveraging-artificial-intelligence-and-machine-learning-to-advance-environmental-health-research-and-decisions\"&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;On June 6 and 7, 2019, the National Academy of Sciences, Engineering, and Medicine (NASEM), hosted a workshop on the use of artificial intelligence (AI) in the field of &lt;a href=\"http://nas-sites.org/emergingscience/meetings/ai/\"&gt;Environmental Health&lt;/a&gt;. Rapid advances in machine learning are demonstrating the ability of machines to carry out repetitive “smart” tasks requiring discreet judgments. Machine learning algorithms are now being used to analyze large volumes of complex data to find patterns and make predictions, often exceeding the accuracy and efficiency of people attempting the same task. Driven by tremendous growth in data availability as well as computing power and accessibility, artificial intelligence and machine learning applications are rapidly growing in various sectors of society including retail, such as predicting consumer purchases; the automotive industry as demonstrated by self-driving cars, and in health care with advances in automated medical diagnoses.&lt;/p&gt;\r\n\r\n&lt;p&gt;Building upon the major themes of the NASEM workshop, in this blog post we address the following questions:&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;&lt;p&gt;How might AI advance environmental health?&lt;/p&gt;&lt;/li&gt;\r\n\r\n&lt;li&gt;&lt;p&gt;Does AI change the standards used for conducting environmental health research?&lt;/p&gt;&lt;/li&gt;\r\n\r\n&lt;li&gt;&lt;p&gt;Does the use of AI allow us to change our established research principles?&lt;/p&gt;&lt;/li&gt;\r\n\r\n&lt;li&gt;&lt;p&gt;How does AI impact our training programs for the next generation of environmental health scientists?&lt;/p&gt;&lt;/li&gt;\r\n\r\n&lt;li&gt;&lt;p&gt;Are there barriers within the current academic incentive structures that are hindering the full potential of AI, and how might those barriers be overcome?&lt;/p&gt;&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;h2 id=\"how-might-ai-advance-environmental-health\"&gt;How might AI advance environmental health?&lt;/h2&gt;\r\n\r\n&lt;p&gt;Environmental health is the study of how the environment affects human health. Due to the complexity of both human biology and the multiplicity of environmental factors that we encounter daily, studying environmental impacts on human health presents many data challenges. Due to the data boom we have seen in recent years we now have a multitude of individualized data including genetic sequencing and wearable health and activity monitors.  We have also seen exponential growth in the availability of data on individual environmental exposures.  Wearable sensors and personal chemical samplers are allowing for more detailed exposure models, whereas advancements in exposure biomonitoring in a variety of matrices including blood and urine is giving more granular detail about actual chemical body burdens. We have also seen an increase in available population level data on dietary factors, the social and built environment, climate, and many other variables affected by environmental and genetic factors. Concurrently, while population data are booming, toxicology is creating a variety of experimental models to advance our understanding of how chemicals and environmental exposures may pose risks to human health. Large-scale high-throughput chemical safety screening efforts can now generate data on tens of thousands of chemicals in thousands of biological targets. Integrating these diverse data streams represents a new level of complexity.&lt;/p&gt;\r\n\r\n&lt;p&gt;AI and machine learning provide many opportunities to make this complexity more manageable, such as highly accurate prediction methods to better assess exposures and flexible approaches to allow incorporation of exposure to complex mixtures in population health analyses. Incorporating artificial intelligence and machine learning methods in environmental health research offers the potential to transform how we analyze environmental exposures and our understanding of how these myriad factors influence our health and contribute to disease.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"does-ai-change-the-standards-used-for-conducting-environmental-health-research\"&gt;Does AI change the standards used for conducting environmental health research?&lt;/h2&gt;\r\n\r\n&lt;p&gt;While we think the use of AI and machine learning techniques clearly hold great promise for the advancement of environmental health research, we also believe such techniques introduce new challenges and magnify existing ones.  While the major standards by which we conduct scientific research do not change, our ability to adhere to them will require some adaptation. Transparency and repeatability are key.  We must ensure that the computational reproducibility and replicability of our scientific findings do not suffer at the hands of complex algorithms and poorly assembled data pipelines. Complex data analyses that incorporate more diverse data types from varied sources stretch our ability to track, curate, and validate these data without robust data curation tools.  Although some data curation tools that establish standard approaches for creating, managing, and maintaining data are available, they are usually field-specific, and currently there are no incentives or strict requirements to ensure that investigators use them.&lt;/p&gt;\r\n\r\n&lt;p&gt;Machine learning and artificial intelligence algorithms have demonstrated themselves to be very powerful. At the same time, we also recognize their complexity and general opacity can be cause for concern. While investigators may be willing to overlook the opacity of these algorithms when predictions are highly accurate and precise, all is well until it isn’t. When an algorithm does not work as expected, it is critical to know why it didn’t work. With transparency and reproducibility of utmost importance, machine learning algorithms must ensure that investigators and data analysts have accountability in their analyses and that regulators have confidence in applying AI generated results to inform public health decisions.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"does-the-use-of-ai-allow-us-to-change-our-established-research-principles\"&gt;Does the use of AI allow us to change our established research principles?&lt;/h2&gt;\r\n\r\n&lt;p&gt;AI does not change established research principles such as sound study designs and understanding threats of bias. However, there is a need to create updated guidelines and implement best practices for choosing, cleaning, structuring, and sharing the data used in AI applications. Creating appropriate training datasets, engaging in ongoing processes of validation, and assessing the domain of applicability for the models that are generated are also important. As in all areas of science, it is crucial to clarify whether models solely provide accurate predictions or whether they also provide understanding of relevant mechanisms. The current Open Science movement’s emphasis on transparency is particularly relevant to the use of AI and machine learning. Users of these methods in environmental health should be looking for ways to be open about the model training data, to clarify validation methods, to create interpretable “models of the models” where possible, and to clarify their domains of applicability. Recent innovations like model cards, or short documents that go alongside machine learning models to share information that everyone impacted by the model should know, is one example of a way model developers can communicate their models’ strengths and weaknesses in a way that is accessible.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"how-does-ai-impact-our-training-programs-for-the-next-generation-of-environmental-health-scientists\"&gt;How does AI impact our training programs for the next generation of environmental health scientists?&lt;/h2&gt;\r\n\r\n&lt;p&gt;As complex AI methods are increasingly applied to environmental health research, it is important to consider effective training of the workforce and its future leaders. Currently, training in the application of data science is unstandardized, as trainees learn how to apply methods to a specific research application through an apprenticeship type model, where a trainee works with a mentor. Classroom training standardizes theory and methods, but the mentor teaches the fine details of analyzing data in a specific research area, which introduces heterogeneity into the ways in which scientists analyze data. The lack of training standards leads to a worry that analysts may apply cutting-edge computational/algorithmic approaches to data analysis, without consideration of fundamental biostatistical and epidemiologic principles, such as statistical design, sampling, and inference.\r\nFundamental questions taught in biostatistics and epidemiology courses, such as &amp;ldquo;Who is in my sample?&amp;rdquo; and &amp;ldquo;What is my target population of inference?&amp;rdquo; are even more relevant in our current era of algorithms and machine learning. Now analysts are agnostically querying databases not designed for population-based research such electronic health records, medical claims, Twitter, Facebook, and Google searches, for new discoveries in environmental health. It is important to recognize that a lack of proper consideration of issues related to sampling, selection bias, correlation of multiple exposures, exposure and outcome misclassification could lead to erroneous results and false conclusions.  Training programs will need to evolve so that we do not just teach scientists and analysts how to program models and interpret their results, but also emphasize how to recognize human biases that can be inadvertently built into the data and model approaches, and the continuous need for rigor, responsibility, and reproducibility.&lt;/p&gt;\r\n\r\n&lt;p&gt;An increased focus on mathematical theory may also improve training in the application of AI to environmental health. A greater effort in developing standardized theory about how and why a specific research area analyses data in a certain way may help adapt approaches from one research area to another. In addition, deeper mathematical exploration of AI methods will help data scientists understand when and why AI methods work well, and when they don’t.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"are-there-barriers-within-the-current-academic-incentive-structures-that-are-hindering-the-full-potential-of-ai-and-how-might-those-barriers-be-overcome\"&gt;Are there barriers within the current academic incentive structures that are hindering the full potential of AI, and how might those barriers be overcome?&lt;/h2&gt;\r\n\r\n&lt;p&gt;Rigorous data science requires a team science approach to achieve a variety of functions such as developing algorithms, formalizing common data platforms and testing protocols, and properly maintaining and curating data sources. Over recent decades, we have witnessed how the power of team science has improved the understanding of critical health problems of our time such as in unlocking the human genome and achieving major advancements in cancer treatment.  These advances have demonstrated the payoff of interdisciplinary, transdisciplinary, and multidisciplinary investigations. Despite these successes, there are still barriers to large team science projects, because these projects often have goals that do not sit precisely within a single funding agency.  In order for AI to truly advance environmental health, federal agencies and institutions that fund environmental health research need to create pathways to support large multi-disciplinary and multi-institutional teams that are conducting this research. An example could be a multi-agency/multi-institute funding consortia. A ten-year investment in a well-coordinated initiative that harnesses AI data opportunities could accelerate new findings in not only the environmental causes of disease, but also in informing interventions that can prevent environmentally mediated disease and improve population health.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"final-thoughts\"&gt;Final thoughts&lt;/h2&gt;\r\n\r\n&lt;p&gt;We believe machine learning and AI methods have tremendous potential but we also believe they cannot be used in a way that overlooks limitations or relaxes data integrity standards. With these considerations in mind, we have tempered enthusiasm for the promises of these approaches. We have to make sure that environmental health scientists stay out in front of these considerations to avoid potential pitfalls such as the allure of hype or chasing after the next new thing because it is novel rather than truly meaningful.  We can do this by fostering ongoing conversations about the challenges and opportunities AI provides for environmental health research. An intentional union of the two cultures of careful (and often overly cautious) stochastic and bold (and often overly optimistic) algorithmic modeling can help to ensuring we are not abandoning principles of proper study design when a new technology comes along, but explore how to use the new technology to better understand the myriad ways the environment affects health and disease.&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/xAmQ6vrlL-Q\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/12/04/is-artificial-intelligence-revolutionizing-environmental-health/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>You can replicate almost any plot with R</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/M6bRYD0Hlmo/</link>\r\n      <pubDate>Wed, 28 Aug 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/08/28/you-can-replicate-almost-any-plot-with-ggplot2/</guid>\r\n      <description>&lt;p&gt;Although R is great for quickly turning data into plots, it is not widely used for making publication ready figures. But, with enough tinkering you can make almost any plot in R. For examples check out the &lt;a href=\"https://flowingdata.com/\"&gt;flowingdata blog&lt;/a&gt; or the &lt;a href=\"https://serialmentor.com/dataviz/index.html\"&gt;Fundamentals of Data Visualization book&lt;/a&gt;.&lt;/p&gt;\r\n&lt;p&gt;Here I show five charts from the lay press that I use as examples in my data science courses. In the past I would show the originals, but I decided to replicate them in R to make it possible to generate class notes with just R code (there was a lot of googling involved).&lt;/p&gt;\r\n&lt;p&gt;Below I show the original figures followed by R code and the version of the plot it produces. I used the &lt;strong&gt;ggplot2&lt;/strong&gt; package but you can achieve similar results using other packages or even just with R-base. Any recommendations on how to improve the code or links to other good examples are welcomed. Please at to the comments or @ me on twitter: &lt;a href=\"https://twitter.com/rafalab\"&gt;@rafalab&lt;/a&gt;.&lt;/p&gt;\r\n&lt;div id=\"example-1\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Example 1&lt;/h2&gt;\r\n&lt;p&gt;The first example is from &lt;a href=\"https://abcnews.go.com/blogs/headlines/2012/12/us-gun-ownership-homicide-rate-higher-than-other-developed-countries/\"&gt;this&lt;/a&gt; ABC news article. Here is the original:&lt;/p&gt;\r\n&lt;div class=\"figure\"&gt;\r\n&lt;img src=\"http://abcnews.go.com/images/International/homocides_g8_countries_640x360_wmain.jpg\" /&gt;\r\n\r\n&lt;/div&gt;\r\n&lt;p&gt;Here is the R code for my version. Note that I copied the values by hand.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;library(tidyverse)\r\nlibrary(ggplot2)\r\nlibrary(ggflags)\r\nlibrary(countrycode)\r\n\r\ndat &amp;lt;- tibble(country = toupper(c(&amp;quot;US&amp;quot;, &amp;quot;Italy&amp;quot;, &amp;quot;Canada&amp;quot;, &amp;quot;UK&amp;quot;, &amp;quot;Japan&amp;quot;, &amp;quot;Germany&amp;quot;, &amp;quot;France&amp;quot;, &amp;quot;Russia&amp;quot;)),\r\n              count = c(3.2, 0.71, 0.5, 0.1, 0, 0.2, 0.1, 0),\r\n              label = c(as.character(c(3.2, 0.71, 0.5, 0.1, 0, 0.2, 0.1)), &amp;quot;No Data&amp;quot;),\r\n              code = c(&amp;quot;us&amp;quot;, &amp;quot;it&amp;quot;, &amp;quot;ca&amp;quot;, &amp;quot;gb&amp;quot;, &amp;quot;jp&amp;quot;, &amp;quot;de&amp;quot;, &amp;quot;fr&amp;quot;, &amp;quot;ru&amp;quot;))\r\n\r\ndat %&amp;gt;% mutate(country = reorder(country, -count)) %&amp;gt;%\r\n  ggplot(aes(country, count, label = label)) +\r\n  geom_bar(stat = &amp;quot;identity&amp;quot;, fill = &amp;quot;darkred&amp;quot;) +\r\n  geom_text(nudge_y = 0.2, color = &amp;quot;darkred&amp;quot;, size = 5) +\r\n  geom_flag(y = -.5, aes(country = code), size = 12) +\r\n  scale_y_continuous(breaks = c(0, 1, 2, 3, 4), limits = c(0,4)) +   \r\n  geom_text(aes(6.25, 3.8, label = &amp;quot;Source UNODC Homicide Statistics&amp;quot;)) + \r\n  ggtitle(toupper(&amp;quot;Homicide Per 100,000 in G-8 Countries&amp;quot;)) + \r\n  xlab(&amp;quot;&amp;quot;) + \r\n  ylab(&amp;quot;# of gun-related homicides\\nper 100,000 people&amp;quot;) +\r\n  ggthemes::theme_economist() +\r\n  theme(axis.text.x = element_text(size = 8, vjust = -16),\r\n        axis.ticks.x = element_blank(),\r\n        axis.line.x = element_blank(),\r\n        plot.margin = unit(c(1,1,1,1), &amp;quot;cm&amp;quot;)) &lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-08-28-you-can-replicate-almost-any-plot-with-ggplot2_files/figure-html/murder-rate-example-1-1.png\" width=\"672\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"example-2\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Example 2&lt;/h2&gt;\r\n&lt;p&gt;The second example from &lt;a href=\"https://everytownresearch.org\"&gt;everytown.org&lt;/a&gt;. Here is the original:&lt;/p&gt;\r\n&lt;div class=\"figure\"&gt;\r\n&lt;img src=\"https://rafalab.github.io/dsbook/R/img/GunTrends_murders_per_1000.png\" /&gt;\r\n\r\n&lt;/div&gt;\r\n&lt;p&gt;Here is the R code for my version. As in the previous example I copied the values by hand.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;dat &amp;lt;- tibble(country = toupper(c(&amp;quot;United States&amp;quot;, &amp;quot;Canada&amp;quot;, &amp;quot;Portugal&amp;quot;, &amp;quot;Ireland&amp;quot;, &amp;quot;Italy&amp;quot;, &amp;quot;Belgium&amp;quot;, &amp;quot;Finland&amp;quot;, &amp;quot;France&amp;quot;, &amp;quot;Netherlands&amp;quot;, &amp;quot;Denmark&amp;quot;, &amp;quot;Sweden&amp;quot;, &amp;quot;Slovakia&amp;quot;, &amp;quot;Austria&amp;quot;, &amp;quot;New Zealand&amp;quot;, &amp;quot;Australia&amp;quot;, &amp;quot;Spain&amp;quot;, &amp;quot;Czech Republic&amp;quot;, &amp;quot;Hungry&amp;quot;, &amp;quot;Germany&amp;quot;, &amp;quot;United Kingdom&amp;quot;, &amp;quot;Norway&amp;quot;, &amp;quot;Japan&amp;quot;, &amp;quot;Republic of Korea&amp;quot;)),\r\n              count = c(3.61, 0.5, 0.48, 0.35, 0.35, 0.33, 0.26, 0.20, 0.20, 0.20, 0.19, 0.19, 0.18, 0.16,\r\n                        0.16, 0.15, 0.12, 0.10, 0.06, 0.04, 0.04, 0.01, 0.01))\r\n\r\ndat %&amp;gt;% \r\n  mutate(country = reorder(country, count)) %&amp;gt;%\r\n  ggplot(aes(country, count, label = count)) +   \r\n  geom_bar(stat = &amp;quot;identity&amp;quot;, fill = &amp;quot;darkred&amp;quot;, width = 0.5) +\r\n  geom_text(nudge_y = 0.2,  size = 3) +\r\n  xlab(&amp;quot;&amp;quot;) + ylab(&amp;quot;&amp;quot;) + \r\n  ggtitle(toupper(&amp;quot;Gun Murders per 100,000 residents&amp;quot;)) + \r\n  theme_minimal() +\r\n  theme(panel.grid.major =element_blank(), panel.grid.minor = element_blank(), \r\n        axis.text.x = element_blank(),\r\n        axis.ticks.length = unit(-0.4, &amp;quot;cm&amp;quot;)) + \r\n  coord_flip() &lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-08-28-you-can-replicate-almost-any-plot-with-ggplot2_files/figure-html/murder-rate-example-2-1.png\" width=\"672\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"example-3\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Example 3&lt;/h2&gt;\r\n&lt;p&gt;The next example is from the &lt;a href=\"http://graphics.wsj.com/infectious-diseases-and-vaccines/?mc_cid=711ddeb86e\"&gt;Wall Street Journal&lt;/a&gt;. The original is interactive but here is a screenshot:&lt;/p&gt;\r\n&lt;div class=\"figure\"&gt;\r\n&lt;img src=\"https://rafalab.github.io/dsbook/dataviz/img/wsj-vaccines.png\" /&gt;\r\n\r\n&lt;/div&gt;\r\n&lt;p&gt;Here is the R code for my version. Note I matched the colors by hand as the original does not seem to follow a standard palette.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;library(dslabs)\r\ndata(us_contagious_diseases)\r\nthe_disease &amp;lt;- &amp;quot;Measles&amp;quot;\r\ndat &amp;lt;- us_contagious_diseases %&amp;gt;%\r\n  filter(!state%in%c(&amp;quot;Hawaii&amp;quot;,&amp;quot;Alaska&amp;quot;) &amp;amp; disease == the_disease) %&amp;gt;%\r\n  mutate(rate = count / population * 10000 * 52 / weeks_reporting) \r\n\r\njet.colors &amp;lt;- colorRampPalette(c(&amp;quot;#F0FFFF&amp;quot;, &amp;quot;cyan&amp;quot;, &amp;quot;#007FFF&amp;quot;, &amp;quot;yellow&amp;quot;, &amp;quot;#FFBF00&amp;quot;, &amp;quot;orange&amp;quot;, &amp;quot;red&amp;quot;, &amp;quot;#7F0000&amp;quot;), bias = 2.25)\r\n\r\ndat %&amp;gt;% mutate(state = reorder(state, desc(state))) %&amp;gt;%\r\n  ggplot(aes(year, state, fill = rate)) +\r\n  geom_tile(color = &amp;quot;white&amp;quot;, size = 0.35) +\r\n  scale_x_continuous(expand = c(0,0)) +\r\n  scale_fill_gradientn(colors = jet.colors(16), na.value = &amp;#39;white&amp;#39;) +\r\n  geom_vline(xintercept = 1963, col = &amp;quot;black&amp;quot;) +\r\n  theme_minimal() + \r\n  theme(panel.grid = element_blank()) +\r\n        coord_cartesian(clip = &amp;#39;off&amp;#39;) +\r\n        ggtitle(the_disease) +\r\n        ylab(&amp;quot;&amp;quot;) +\r\n        xlab(&amp;quot;&amp;quot;) +  \r\n        theme(legend.position = &amp;quot;bottom&amp;quot;, text = element_text(size = 8)) + \r\n        annotate(geom = &amp;quot;text&amp;quot;, x = 1963, y = 50.5, label = &amp;quot;Vaccine introduced&amp;quot;, size = 3, hjust = 0)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-08-28-you-can-replicate-almost-any-plot-with-ggplot2_files/figure-html/wsj-vaccines-example-1.png\" width=\"100%\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"example-4\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Example 4&lt;/h2&gt;\r\n&lt;p&gt;The next example is from the &lt;a href=\"https://www.nytimes.com/2011/02/19/nyregion/19schools.html\"&gt;New York Times&lt;/a&gt;. Here is the original:&lt;/p&gt;\r\n&lt;div class=\"figure\"&gt;\r\n&lt;img src=\"http://graphics8.nytimes.com/images/2011/02/19/nyregion/19schoolsch/19schoolsch-popup.gif\" /&gt;\r\n\r\n&lt;/div&gt;\r\n&lt;p&gt;Here is the R code for my version:&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;data(&amp;quot;nyc_regents_scores&amp;quot;)\r\nnyc_regents_scores$total &amp;lt;- rowSums(nyc_regents_scores[,-1], na.rm=TRUE)\r\nnyc_regents_scores %&amp;gt;% \r\n  filter(!is.na(score)) %&amp;gt;%\r\n  ggplot(aes(score, total)) + \r\n  annotate(&amp;quot;rect&amp;quot;, xmin = 65, xmax = 99, ymin = 0, ymax = 35000, alpha = .5) +\r\n  geom_bar(stat = &amp;quot;identity&amp;quot;, color = &amp;quot;black&amp;quot;, fill = &amp;quot;#C4843C&amp;quot;) + \r\n  annotate(&amp;quot;text&amp;quot;, x = 66, y = 28000, label = &amp;quot;MINIMUM\\nREGENTS DIPLOMA\\nSCORE IS 65&amp;quot;, hjust = 0, size = 3) +\r\n  annotate(&amp;quot;text&amp;quot;, x = 0, y = 12000, label = &amp;quot;2010 Regents scores on\\nthe five most common tests&amp;quot;, hjust = 0, size = 3) +\r\n  scale_x_continuous(breaks = seq(5, 95, 5), limit = c(0,99)) + \r\n  scale_y_continuous(position = &amp;quot;right&amp;quot;) +\r\n  ggtitle(&amp;quot;Scraping By&amp;quot;) + \r\n  xlab(&amp;quot;&amp;quot;) + ylab(&amp;quot;Number of tests&amp;quot;) + \r\n  theme_minimal() + \r\n  theme(panel.grid.major.x = element_blank(), \r\n        panel.grid.minor.x = element_blank(),\r\n        axis.ticks.length = unit(-0.2, &amp;quot;cm&amp;quot;),\r\n        plot.title = element_text(face = &amp;quot;bold&amp;quot;))&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-08-28-you-can-replicate-almost-any-plot-with-ggplot2_files/figure-html/regents-exams-example-1.png\" width=\"768\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"example-5\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Example 5&lt;/h2&gt;\r\n&lt;p&gt;This last one is from &lt;a href=\"https://projects.fivethirtyeight.com/2016-election-forecast/\"&gt;fivethirtyeight&lt;/a&gt;.&lt;/p&gt;\r\n&lt;div class=\"figure\"&gt;\r\n&lt;img src=\"https://rafalab.github.io/dsbook/inference/img/popular-vote-538.png\" /&gt;\r\n\r\n&lt;/div&gt;\r\n&lt;p&gt;Below is the R code for my version. Note that in this example I am essentially just drawing as I don’t estimate the distributions myself. I simply estimated parameters “by eye” and used a bit of trial and error.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;my_dgamma &amp;lt;- function(x, mean = 1, sd = 1){\r\n  shape = mean^2/sd^2\r\n  scale = sd^2 / mean\r\n  dgamma(x, shape = shape, scale = scale)\r\n}\r\n\r\nmy_qgamma &amp;lt;- function(mean = 1, sd = 1){\r\n  shape = mean^2/sd^2\r\n  scale = sd^2 / mean\r\n  qgamma(c(0.1,0.9), shape = shape, scale = scale)\r\n}\r\n\r\ntmp &amp;lt;- tibble(candidate = c(&amp;quot;Clinton&amp;quot;, &amp;quot;Trump&amp;quot;, &amp;quot;Johnson&amp;quot;), \r\n              avg = c(48.5, 44.9, 5.0), \r\n              avg_txt = c(&amp;quot;48.5%&amp;quot;, &amp;quot;44.9%&amp;quot;, &amp;quot;5.0%&amp;quot;), \r\n              sd = rep(2, 3), \r\n              m = my_dgamma(avg, avg, sd)) %&amp;gt;%\r\n  mutate(candidate = reorder(candidate, -avg))\r\n\r\nxx &amp;lt;- seq(0, 75, len = 300)\r\n\r\ntmp_2 &amp;lt;- map_df(1:3, function(i){\r\n  tibble(candidate = tmp$candidate[i],\r\n         avg = tmp$avg[i],\r\n         sd = tmp$sd[i],\r\n         x = xx,\r\n         y = my_dgamma(xx, tmp$avg[i], tmp$sd[i]))\r\n})\r\n\r\ntmp_3 &amp;lt;- map_df(1:3, function(i){\r\n  qq &amp;lt;- my_qgamma(tmp$avg[i], tmp$sd[i])\r\n  xx &amp;lt;- seq(qq[1], qq[2], len = 200)\r\n  tibble(candidate = tmp$candidate[i],\r\n         avg = tmp$avg[i],\r\n         sd = tmp$sd[i],\r\n         x = xx,\r\n         y = my_dgamma(xx, tmp$avg[i], tmp$sd[i]))\r\n})\r\n         \r\ntmp_2 %&amp;gt;% \r\n  ggplot(aes(x, ymax = y, ymin = 0)) +\r\n  geom_ribbon(fill = &amp;quot;grey&amp;quot;) + \r\n  facet_grid(candidate~., switch = &amp;quot;y&amp;quot;) +\r\n  scale_x_continuous(breaks = seq(0, 75, 25), position = &amp;quot;top&amp;quot;,\r\n                     label = paste0(seq(0, 75, 25), &amp;quot;%&amp;quot;)) +\r\n  geom_abline(intercept = 0, slope = 0) +\r\n  xlab(&amp;quot;&amp;quot;) + ylab(&amp;quot;&amp;quot;) + \r\n  theme_minimal() + \r\n  theme(panel.grid.major.y = element_blank(), \r\n        panel.grid.minor.y = element_blank(),\r\n        axis.title.y = element_blank(),\r\n        axis.text.y = element_blank(),\r\n        axis.ticks.y = element_blank(),\r\n        strip.text.y = element_text(angle = 180, size = 11, vjust = 0.2)) + \r\n  geom_ribbon(data = tmp_3, mapping = aes(x = x, ymax = y, ymin = 0, fill = candidate), inherit.aes = FALSE, show.legend = FALSE) +\r\n  scale_fill_manual(values = c(&amp;quot;#3cace4&amp;quot;, &amp;quot;#fc5c34&amp;quot;, &amp;quot;#fccc2c&amp;quot;)) +\r\n  geom_point(data = tmp, mapping = aes(x = avg, y = m), inherit.aes = FALSE) + \r\n  geom_text(data = tmp, mapping = aes(x = avg, y = m, label = avg_txt), inherit.aes = FALSE, hjust = 0, nudge_x = 1) &lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-08-28-you-can-replicate-almost-any-plot-with-ggplot2_files/figure-html/fivethirtyeight-densities-1.png\" width=\"80%\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/M6bRYD0Hlmo\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/08/28/you-can-replicate-almost-any-plot-with-ggplot2/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>So You Want to Start a Podcast</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/HJdqhW8Eu3U/</link>\r\n      <pubDate>Tue, 27 Aug 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/08/27/so-you-want-to-start-a-podcast/</guid>\r\n      <description>&lt;p&gt;Podcasting has gotten quite a bit easier over the past 10 years, due in part to improvements to hardware and software. I wrote about both how I &lt;a href=\"https://simplystatistics.org/2017/09/18/editing-podcasts-logic-pro-x/\"&gt;edit&lt;/a&gt; and &lt;a href=\"https://simplystatistics.org/2017/09/20/recording-podcasts-with-a-remote-cohost/\"&gt;record&lt;/a&gt; both of my podcasts about 2 years ago and, while not much has changed since then, I thought it might be helpful if I organized the information in a better way for people just starting out with a new podcast.&lt;/p&gt;\r\n\r\n&lt;p&gt;One frustrating problem that I find with podcasting is that the easy methods are indeed easy, and the difficult methods are indeed difficult, but the methods that are just &lt;em&gt;above&lt;/em&gt; easy, which other markets might label as “prosumer” or something like that, are&amp;hellip;kind of hard. One of the reasons is that once you start buying better hardware, everything kind of snowballs because the hardware becomes more modular. So instead of just using your phone headphones to record, you might buy a microphone, that connects to a stand, that connects to a USB interface using an XLR cable, that connects to your computer. Similarly, on the software side, there’s really not much out there that’s free. As a result of both phenomena, costs start to go up pretty quickly as soon as you step up just a little bit.&lt;/p&gt;\r\n\r\n&lt;p&gt;I can’t do anything about costs, but I thought I could help a little bit on sorting out what’s out there and what’s genuinely valuable. There are two versions here: the free and easy plan if you’re just starting out and the next level up, which is basically what I use.&lt;/p&gt;\r\n\r\n&lt;p&gt;The three things I’ll cover here that you need for podcasting are:&lt;/p&gt;\r\n\r\n&lt;ol&gt;\r\n&lt;li&gt;&lt;strong&gt;Hardware&lt;/strong&gt; - this includes all recording equipment like microphones, stands, cables, etc.&lt;/li&gt;\r\n&lt;li&gt;&lt;strong&gt;Recording Software&lt;/strong&gt; - Unless you live in a recording booth you’ll need some software for your computer (which I assume you have!)&lt;/li&gt;\r\n&lt;li&gt;&lt;strong&gt;Editing Software&lt;/strong&gt; - the more complicated your podcast gets the more you’ll need to edit (beyond just trimming the beginning and end of the audio files)&lt;/li&gt;\r\n&lt;li&gt;&lt;strong&gt;Hosting&lt;/strong&gt; - Unless you plan on running your own server (which is an option but I don’t recommend it) you’ll need someone to host your audio files.&lt;/li&gt;\r\n&lt;/ol&gt;\r\n\r\n&lt;h2 id=\"free-and-easy\"&gt;Free and Easy&lt;/h2&gt;\r\n\r\n&lt;p&gt;There are in fact ways to podcast for free and many people stay at this level for a long time because the quality is acceptable and cost is zero. If you want to just get started quickly here’s what you can do:&lt;/p&gt;\r\n\r\n&lt;ol&gt;\r\n&lt;li&gt;&lt;strong&gt;Hardware&lt;/strong&gt; - just use the headphones/microphone that came with your mobile phone.&lt;/li&gt;\r\n&lt;li&gt;&lt;strong&gt;Recording Software&lt;/strong&gt; - If you are doing a podcast by yourself, you can just use whatever app your phone has to record things like voice memos. On your computer, there should be a built-in app that just lets you record sound through the headphones.&lt;/li&gt;\r\n&lt;li&gt;&lt;strong&gt;Editing Software&lt;/strong&gt; - For editing I recommend either not editing (simpler!) or using something like &lt;a href=\"https://www.audacityteam.org\"&gt;Audacity&lt;/a&gt; to just trim the beginning and the end.&lt;/li&gt;\r\n&lt;li&gt;&lt;strong&gt;Hosting&lt;/strong&gt; - SoundCloud offers free hosting for up to 3 hours of content. This is plenty for just starting out and seeing if you like it, but you will likely use it up.&lt;/li&gt;\r\n&lt;/ol&gt;\r\n\r\n&lt;p&gt;If you are working with a partner, it gets a little more complicated and there are some additional notes on the recording software. My go-to recommendation for recording with a partner is to use &lt;a href=\"https://zencastr.com/\"&gt;Zencastr&lt;/a&gt;. Zencastr has a free plan that lets you record high-quality audio for a max of 2 people. (If you need to record more than 2 people, you can’t use the free option.) The nice thing about Zencastr is that it uses &lt;a href=\"https://en.wikipedia.org/wiki/WebRTC\"&gt;WebRTC&lt;/a&gt; to record directly off your microphone, so you don’t need to worry too much about the quality of your internet connection. What you get is separate audio files, one for each speaker, that are synched together. Occasionally, there are some synching glitches, but usually it works out. The files are automatically uploaded to a Dropbox account, so you’ll need one of those. Because Zencastr automatically goes to MP3 format, the files are relatively small. Also, if you have a guest who is less familiar with audio hardware/software, you can just send them a link that they can click on and they’re recording.&lt;/p&gt;\r\n\r\n&lt;p&gt;Note that even if your partner is sitting right next to you, it’s often simpler to just go to separate spaces and record “remotely”. The primary benefit of doing this is that you can cleanly record separate/independent audio tracks. This can be useful in the editing process.&lt;/p&gt;\r\n\r\n&lt;p&gt;If you prefer an all-in-one solution, there are services like &lt;a href=\"https://tryca.st\"&gt;Cast&lt;/a&gt; and &lt;a href=\"https://anchor.fm\"&gt;Anchor&lt;/a&gt; that offer recording, hosting, and distribution. Cast only has a free 1-month trial and so you have to pay eventually. Anchor appears to be free (I’ve never used it), but it was recently purchased by Spotify so it’s not immediately clear to me if anything will change. My guess is they’ll likely stay free because they want as many people making podcasts as possible. Anchor didn’t exist when I started podcasting but if it had I might have used it first. But it always makes me a little nervous when I can’t figure out how a company makes money.&lt;/p&gt;\r\n\r\n&lt;p&gt;To summarize, here’s the “free and easy” workflow that I recommend:&lt;/p&gt;\r\n\r\n&lt;ol&gt;\r\n&lt;li&gt;Record your podcast using Zencastr (especially if you have a partner), which then puts audio files on Dropbox&lt;/li&gt;\r\n&lt;li&gt;Trim beginning/ending of audio file with Audacity&lt;/li&gt;\r\n&lt;li&gt;Upload audio to SoundCloud and add episode metadata&lt;/li&gt;\r\n&lt;/ol&gt;\r\n\r\n&lt;p&gt;And here are the pros and cons:&lt;/p&gt;\r\n\r\n&lt;p&gt;Pros&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;It’s free&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;Cons&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;Audio quality is acceptable but not great. Earbud type microphones are not designed for high quality and you can usually tell when someone has used them to record. Given that podcasts are all about audio, it’s hard for me to trade off audio quality.&lt;/li&gt;\r\n&lt;li&gt;Hosting limitations mean you can only get a few episodes up. But that’s a problem for down the road, right?&lt;/li&gt;\r\n&lt;li&gt;Editing is generally a third-order issue, but there is one scenario where it can be critical&amp;mdash;when you have a bad internet connection. Bad internet connections can introduce delays and cross-talk. These problems can be mitigated when editing (I give an example &lt;a href=\"https://simplystatistics.org/2017/09/18/editing-podcasts-logic-pro-x/\"&gt;here&lt;/a&gt;) but only with better software.&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;h2 id=\"beyond-free\"&gt;Beyond Free&lt;/h2&gt;\r\n\r\n&lt;p&gt;Beyond the free workflow, there are a number of upgrades that you can make and you can easily start spending a lot of money. But the only real upgrade that I think you need to make is to buy a good microphone. Surprisingly, this does not need to cost much money. The best podcasting microphone for the money out there is the &lt;a href=\"https://www.amazon.com/Audio-Technica-ATR2100-USB-Cardioid-Dynamic-Microphone/dp/B004QJOZS4/ref=sr_1_3?crid=35TCYURP9DCY0&amp;amp;keywords=audio+technical%27s+atr2100&amp;amp;qid=1566911267&amp;amp;s=gateway&amp;amp;sprefix=audio+technical%27s+atr2100%2Caps%2C122&amp;amp;sr=8-3\"&gt;Audio Technica ATR2100 USB micrphone&lt;/a&gt;. This is the microphone that Elizabeth uses on the &lt;a href=\"https://effortreport.libsyn.com\"&gt;The Effort Report&lt;/a&gt; and Hilary uses on &lt;a href=\"http://nssdeviations.com\"&gt;Not So Standard Deviations&lt;/a&gt;. As of this writing it’s \\$65 on Amazon, but I’ve seen it for as low as \\$40. The benefits of this microphone are:&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;The audio quality is high&lt;/li&gt;\r\n&lt;li&gt;It isolates vocal audio really well and doesn’t pick up a lot of background audio (good for noisy rooms like my office).&lt;/li&gt;\r\n&lt;li&gt;It connects directly to a computer via USB so you don’t need to buy a separate USB interface.&lt;/li&gt;\r\n&lt;li&gt;It’s cheap&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;The problem with getting “better” (i.e. more expensive) microphones as that they tend to be more sensitive, which means they pick up more high-frequency background noise. Professional microphones are designed for you to be working in a sound-proof recording studio environment in which you want to pick up as much sound as possible. But podcasting, in general, tends to take place wherever. So you want a microphone that will only pick up your voice right in front of it. Technically, you lose a little quality this way, but it’s equally annoying to have a lot of background noise.&lt;/p&gt;\r\n\r\n&lt;p&gt;Now that you’ve got a microphone, you need to stick it somewhere. While you can always just hold the microphone, I’d recommend an adjustable stand of some sort. Desk stands like &lt;a href=\"https://www.amazon.com/InnoGear-Microphone-Suspension-Adjustable-Snowball/dp/B01L3LL95O/ref=sr_1_2_sspa?keywords=microphone+desk+stand&amp;amp;qid=1566911946&amp;amp;s=musical-instruments&amp;amp;sr=1-2-spons&amp;amp;psc=1&amp;amp;spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUE1OUJZT05aSEdVWkMmZW5jcnlwdGVkSWQ9QTA2OTg0ODgySElRWktJSjk1WFVRJmVuY3J5cHRlZEFkSWQ9QTAzMjI0NTVaSENFTVJaOFhZSUsmd2lkZ2V0TmFtZT1zcF9hdGYmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl\"&gt;this one&lt;/a&gt; are nice because they’re adjustable but they do require you to have a semi-permanent office where you can just keep it. The main point here is that podcasting requires you to sit still and talk for a while, and you don’t want to be uncomfortable while you’re doing it.&lt;/p&gt;\r\n\r\n&lt;p&gt;The last upgrade you’ll likely need to make is the hosting provider. SoundCloud itself offers an unlimited plan but I don’t recommend it as it’s not really designed for podcasting. I use &lt;a href=\"https://libsyn.com\"&gt;Libsyn&lt;/a&gt;, which has a $5 a month plan that should be enough for a monthly podcast. They also provide some decent analytics that you can download and read into R. What I like about Libsyn is that they do one job and they do it really well. I give them money, and they provide me a service in return. How simple is that?&lt;/p&gt;\r\n\r\n&lt;p&gt;That’s it for now. I’m happy to make more recommendations regarding software and hardware (feel free to tweet me @rdpeng), but I think what I’ve got here should get you 99% of the way there.&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/HJdqhW8Eu3U\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/08/27/so-you-want-to-start-a-podcast/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>The data deluge means no reasonable expectation of privacy - now what?</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/KncmHv4k4Z4/</link>\r\n      <pubDate>Tue, 23 Jul 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/07/23/the-data-deluge-means-no-reasonable-expectation-of-privacy-no-what/</guid>\r\n      <description>&lt;p&gt;Today a couple of different things reminded me about something that I suppose &lt;a href=\"https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815\"&gt;many&lt;/a&gt; &lt;a href=\"https://www.nytimes.com/2019/04/28/opinion/fourth-amendment-privacy.html\"&gt;people&lt;/a&gt; &lt;a href=\"https://www.sciencenews.org/article/family-tree-dna-sharing-genetic-data-police-privacy\"&gt;are talking about&lt;/a&gt; but has been on my mind as well.&lt;/p&gt;\r\n\r\n&lt;p&gt;The idea is that many of our societies social norms are based on the reasonable expectation of privacy. But the reasonable expectation of privacy is increasingly a thing of the past. Three types of data I&amp;rsquo;ve been thinking about are:&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;&lt;strong&gt;Obviously identifying data&lt;/strong&gt;: Data like cellphone GPS traces and public social media posts are obviously information that is indentifiable and reduce privacy.&lt;/li&gt;\r\n&lt;li&gt;&lt;strong&gt;Data that can be inferred from public data&lt;/strong&gt;: We can also now infer a lot about people given the data that is public. For example a couple of years ago I challenged the students in my advanced data science class to predict the &lt;a href=\"https://bcrisktool.cancer.gov/\"&gt;Gail score&lt;/a&gt; - one of the most widely used measures of breast cancer risk  - using only the information available from a person&amp;rsquo;s public Facebook profile. While not all of the information was available, a good fraction of it was. This is an example of something you might not think that posting pictures of your family, your birthday celebrations, and family life events could enable. I was reminded of this when hearing about this &lt;a href=\"https://www.nature.com/articles/s41467-019-10933-3\"&gt;paper&lt;/a&gt; that claims to be able to deidentify up to 99.98\\% of Americans using only 15 pieces of demographic information.&lt;/li&gt;\r\n&lt;li&gt;&lt;strong&gt;Data other people share about us&lt;/strong&gt;: The stories around the &lt;a href=\"https://www.theatlantic.com/science/archive/2018/04/golden-state-killer-east-area-rapist-dna-genealogy/559070/\"&gt;capture of the Golden Gate Killer using genealogy data&lt;/a&gt; make it clear that even when you personally don&amp;rsquo;t share your data, someone else may be sharing it for you. The same can be said of photos of you that were tagged on Facebook even if you aren&amp;rsquo;t on the platform.&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;I don&amp;rsquo;t think these types of data are going to magically disappear. So like a lot of other people I&amp;rsquo;ve been wondering how we should individually and as a society adapt to the world where privacy is no longer an expectation.&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/KncmHv4k4Z4\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/07/23/the-data-deluge-means-no-reasonable-expectation-of-privacy-no-what/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>More datasets for teaching data science: The expanded dslabs package</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/zawn1oAM46c/</link>\r\n      <pubDate>Fri, 19 Jul 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/07/19/more-datasets-for-teaching-data-science-the-expanded-dslabs-package/</guid>\r\n      <description>&lt;div id=\"introduction\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Introduction&lt;/h2&gt;\r\n&lt;p&gt;We have expanded the &lt;a href=\"https://cran.r-project.org/web/packages/dslabs/index.html\"&gt;dslabs package&lt;/a&gt;, which we &lt;a href=\"https://simplystatistics.org/2018/01/22/the-dslabs-package-provides-datasets-for-teaching-data-science/\"&gt;previously introduced&lt;/a&gt; as a package containing realistic, interesting and approachable datasets that can be used in introductory data science courses.&lt;/p&gt;\r\n&lt;p&gt;This release adds 7 new datasets on climate change, astronomy, life expectancy, and breast cancer diagnosis. They are used in improved problem sets and new projects within the &lt;a href=\"https://www.edx.org/professional-certificate/harvardx-data-science\"&gt;HarvardX Data Science Professional Certificate Program&lt;/a&gt;, which teaches beginning R programming, data visualization, data wrangling, statistics, and machine learning for students with no prior coding background.&lt;/p&gt;\r\n&lt;p&gt;You can install the &lt;a href=\"https://cran.r-project.org/web/packages/dslabs/index.html\"&gt;dslabs package&lt;/a&gt; from CRAN:&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;install.packages(&amp;quot;dslabs&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;If you already have the package installed, you can add the new datasets by updating the package:&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;update.packages(&amp;quot;dslabs&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;You can load the package into your workspace normally:&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;library(dslabs)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;Let’s preview these new datasets! To code along, use the following libraries and options:&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;# install packages if necessary\r\nif(!require(&amp;quot;tidyverse&amp;quot;)) install.packages(&amp;quot;tidyverse&amp;quot;)\r\nif(!require(&amp;quot;ggrepel&amp;quot;)) install.packages(&amp;quot;ggrepel&amp;quot;)\r\nif(!require(&amp;quot;matrixStats&amp;quot;)) install.packages(&amp;quot;matrixStats&amp;quot;)\r\n\r\n\r\n# load libraries\r\nlibrary(tidyverse)\r\nlibrary(ggrepel)\r\nlibrary(matrixStats)\r\n\r\n# set colorblind-friendly color palette\r\ncolorblind_palette &amp;lt;- c(&amp;quot;black&amp;quot;, &amp;quot;#E69F00&amp;quot;, &amp;quot;#56B4E9&amp;quot;, &amp;quot;#009E73&amp;quot;,\r\n                        &amp;quot;#CC79A7&amp;quot;, &amp;quot;#F0E442&amp;quot;, &amp;quot;#0072B2&amp;quot;, &amp;quot;#D55E00&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"climate-change\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Climate change&lt;/h2&gt;\r\n&lt;p&gt;Three datasets related to climate change are used to teach data visualization and data wrangling. These data produce clear plots that demonstrate an increase in temperature, greenhouse gas levels, and carbon emissions from 800,000 years ago to modern times. Students can create their own impactful visualizations with real atmospheric and ice core measurements.&lt;/p&gt;\r\n&lt;div id=\"modern-temperature-anomaly-and-carbon-dioxide-data-temp_carbon\" class=\"section level3\"&gt;\r\n&lt;h3&gt;Modern temperature anomaly and carbon dioxide data: &lt;code&gt;temp_carbon&lt;/code&gt;&lt;/h3&gt;\r\n&lt;p&gt;The &lt;code&gt;temp_carbon&lt;/code&gt; dataset includes annual global temperature anomaly measurements in degrees Celsius relative to the 20th century mean temperature from 1880-2018. The temperature anomalies over land and over ocean are reported also. In addition, it includes annual carbon emissions (in millions of metric tons) from 1751-2014. Temperature anomalies are from &lt;a href=\"https://www.ncdc.noaa.gov/cag/global/time-series\"&gt;NOAA&lt;/a&gt; and carbon emissions are from &lt;a href=\"https://cdiac.ess-dive.lbl.gov/trends/emis/tre_glob_2014.html\"&gt;Boden et al., 2017 via CDIAC&lt;/a&gt;.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;data(temp_carbon)\r\n\r\n# line plot of annual global, land and ocean temperature anomalies since 1880\r\ntemp_carbon %&amp;gt;%\r\n    select(Year = year, Global = temp_anomaly, Land = land_anomaly, Ocean = ocean_anomaly) %&amp;gt;%\r\n    gather(Region, Temp_anomaly, Global:Ocean) %&amp;gt;%\r\n    ggplot(aes(Year, Temp_anomaly, col = Region)) +\r\n    geom_line(size = 1) +\r\n    geom_hline(aes(yintercept = 0), col = colorblind_palette[8], lty = 2) +\r\n    geom_label(aes(x = 2005, y = -.08), col = colorblind_palette[8], \r\n               label = &amp;quot;20th century mean&amp;quot;, size = 4) +\r\n    ylab(&amp;quot;Temperature anomaly (degrees C)&amp;quot;) +\r\n    xlim(c(1880, 2018)) +\r\n    scale_color_manual(values = colorblind_palette) +\r\n    ggtitle(&amp;quot;Temperature anomaly relative to 20th century mean, 1880-2018&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-07-19-more-datasets-for-teaching-data-science-the-expanded-dslabs-package_files/figure-html/unnamed-chunk-5-1.png\" width=\"672\" style=\"display: block; margin: auto;\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"greenhouse-gas-concentrations-over-2000-years-greenhouse_gases\" class=\"section level3\"&gt;\r\n&lt;h3&gt;Greenhouse gas concentrations over 2000 years: &lt;code&gt;greenhouse_gases&lt;/code&gt;&lt;/h3&gt;\r\n&lt;p&gt;The &lt;code&gt;greenhouse_gases&lt;/code&gt; data frame contains carbon dioxide (&lt;span class=\"math inline\"&gt;\\(\\mbox{CO}_2\\)&lt;/span&gt;, ppm), methane (&lt;span class=\"math inline\"&gt;\\(\\mbox{CO}_2\\)&lt;/span&gt;, ppb) and nitrous oxide (&lt;span class=\"math inline\"&gt;\\(\\mbox{N}_2\\mbox{O}\\)&lt;/span&gt;, ppb) concentrations every 20 years from 0-2000 CE. The data are a subset of ice core measurements from &lt;a href=\"https://www.ncdc.noaa.gov/paleo-search/study/9959\"&gt;MacFarling Meure et al., 2006 via NOAA&lt;/a&gt;. There is a clear increase in all 3 gases starting around the time of the Industrial Revolution.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;data(greenhouse_gases)\r\n\r\n# line plots of atmospheric concentrations of the three major greenhouse gases since 0 CE\r\ngreenhouse_gases %&amp;gt;%\r\n    ggplot(aes(year, concentration)) +\r\n    geom_line() +\r\n    facet_grid(gas ~ ., scales = &amp;quot;free&amp;quot;) +\r\n    xlab(&amp;quot;Year&amp;quot;) +\r\n    ylab(&amp;quot;Concentration (CH4/N2O ppb, CO2 ppm)&amp;quot;) +\r\n    ggtitle(&amp;quot;Atmospheric greenhouse gas concentration by year, 0-2000 CE&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-07-19-more-datasets-for-teaching-data-science-the-expanded-dslabs-package_files/figure-html/unnamed-chunk-6-1.png\" width=\"672\" style=\"display: block; margin: auto;\" /&gt;&lt;/p&gt;\r\n&lt;p&gt;Compare this pattern with manmade carbon emissions since 1751 from &lt;code&gt;temp_carbon&lt;/code&gt;, which have risen in a similar way:&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;# line plot of anthropogenic carbon emissions over 250+ years\r\ntemp_carbon %&amp;gt;%\r\n    ggplot(aes(year, carbon_emissions)) +\r\n    geom_line() +\r\n    xlab(&amp;quot;Year&amp;quot;) +\r\n    ylab(&amp;quot;Carbon emissions (metric tons)&amp;quot;) +\r\n    ggtitle(&amp;quot;Annual global carbon emissions, 1751-2014&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-07-19-more-datasets-for-teaching-data-science-the-expanded-dslabs-package_files/figure-html/unnamed-chunk-7-1.png\" width=\"672\" style=\"display: block; margin: auto;\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"carbon-dioxide-levels-over-the-last-800000-years-historic_co2\" class=\"section level3\"&gt;\r\n&lt;h3&gt;Carbon dioxide levels over the last 800,000 years, &lt;code&gt;historic_co2&lt;/code&gt;&lt;/h3&gt;\r\n&lt;p&gt;A common argument against the existence of anthropogenic climate change is that the Earth naturally undergoes cycles of warming and cooling governed by natural changes beyond human control. &lt;span class=\"math inline\"&gt;\\(\\mbox{CO}_2\\)&lt;/span&gt; levels from ice cores and modern atmospheric measurements at the Mauna Loa observatory demonstrate that the speed and magnitude of natural variations in greenhouse gases pale in comparison to the rapid changes in modern industrial times. While the planet has been hotter and had higher &lt;span class=\"math inline\"&gt;\\(\\mbox{CO}_2\\)&lt;/span&gt; levels in the distant past (data not shown), the current unprecedented rate of change leaves little time for planetary systems to adapt.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;data(historic_co2)\r\n\r\n# line plot of atmospheric CO2 concentration over 800K years, colored by data source\r\nhistoric_co2 %&amp;gt;%\r\n    ggplot(aes(year, co2, col = source)) +\r\n    geom_line() +\r\n    ylab(&amp;quot;CO2 (ppm)&amp;quot;) +\r\n    scale_color_manual(values = colorblind_palette[7:8]) +\r\n    ggtitle(&amp;quot;Atmospheric CO2 concentration, -800,000 BCE to today&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-07-19-more-datasets-for-teaching-data-science-the-expanded-dslabs-package_files/figure-html/unnamed-chunk-8-1.png\" width=\"672\" style=\"display: block; margin: auto;\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"properties-of-stars-for-making-an-h-r-diagram-stars\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Properties of stars for making an H-R diagram: &lt;code&gt;stars&lt;/code&gt;&lt;/h2&gt;\r\n&lt;p&gt;In astronomy, stars are classified by several key features, including temperature, spectral class (color) and luminosity (brightness). A common plot for demonstrating the different groups of stars and their propreties is the Hertzsprung-Russell diagram, or H-R diagram. The &lt;code&gt;stars&lt;/code&gt; data frame compiles information for making an H-R diagram with about approximately 100 named stars, including their temperature, spectral class and magnitude (which is inversely proportional to luminosity).&lt;/p&gt;\r\n&lt;p&gt;The H-R diagram has the hottest, brightest stars in the upper left and coldest, dimmest stars in the lower right. Main sequence stars are along the main diagonal, while giants are in the upper right and dwarfs are in the lower left. Several aspects of data visualization can be practiced with these data.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;data(stars)\r\n\r\n# H-R diagram color-coded by spectral class\r\nstars %&amp;gt;%\r\n    mutate(type = factor(type, levels = c(&amp;quot;O&amp;quot;, &amp;quot;B&amp;quot;, &amp;quot;DB&amp;quot;, &amp;quot;A&amp;quot;, &amp;quot;DA&amp;quot;, &amp;quot;DF&amp;quot;, &amp;quot;F&amp;quot;, &amp;quot;G&amp;quot;, &amp;quot;K&amp;quot;, &amp;quot;M&amp;quot;)),\r\n           star = ifelse(star %in% c(&amp;quot;Sun&amp;quot;, &amp;quot;Polaris&amp;quot;, &amp;quot;Betelgeuse&amp;quot;, &amp;quot;Deneb&amp;quot;,\r\n                                     &amp;quot;Regulus&amp;quot;, &amp;quot;*SiriusB&amp;quot;, &amp;quot;Alnitak&amp;quot;, &amp;quot;*ProximaCentauri&amp;quot;),\r\n                         as.character(star), NA)) %&amp;gt;%\r\n    ggplot(aes(log10(temp), magnitude, col = type)) +\r\n    geom_point() +\r\n    geom_label_repel(aes(label = star)) +\r\n    scale_x_reverse() +\r\n    scale_y_reverse() +\r\n    xlab(&amp;quot;Temperature (log10 degrees K)&amp;quot;) +\r\n    ylab(&amp;quot;Magnitude&amp;quot;) +\r\n    labs(color = &amp;quot;Spectral class&amp;quot;) +\r\n    ggtitle(&amp;quot;H-R diagram of selected stars&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;pre&gt;&lt;code&gt;## Warning: Removed 88 rows containing missing values (geom_label_repel).&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-07-19-more-datasets-for-teaching-data-science-the-expanded-dslabs-package_files/figure-html/unnamed-chunk-9-1.png\" width=\"672\" style=\"display: block; margin: auto;\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"united-states-period-life-tables-death_prob\" class=\"section level2\"&gt;\r\n&lt;h2&gt;United States period life tables: &lt;code&gt;death_prob&lt;/code&gt;&lt;/h2&gt;\r\n&lt;p&gt;Obtained from the &lt;a href=\"https://www.ssa.gov/oact/STATS/table4c6.html\"&gt;US Social Security Administration&lt;/a&gt;, the 2015 period life table lists the probability of death within one year at every age and for both sexes. These values are commonly used to calculate life insurance premiums. They can be used for exercises on probability and random variables. For example, the premiums can be calculated with a similar approach to that used for interest rates in this &lt;a href=\"https://rafalab.github.io/dsbook/random-variables.html#case-study-the-big-short\"&gt;case study on The Big Short&lt;/a&gt; in Rafael Irizarry’s &lt;a href=\"https://leanpub.com/datasciencebook\"&gt;Introduction to Data Science textbook&lt;/a&gt;.&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"brexit-polling-data-brexit_polls\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Brexit polling data: &lt;code&gt;brexit_polls&lt;/code&gt;&lt;/h2&gt;\r\n&lt;p&gt;&lt;code&gt;brexit_polls&lt;/code&gt; contains vote percentages and spreads from the six months prior to the Brexit EU membership referendum in 2016 compiled from &lt;a href=\"https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_United_Kingdom_European_Union_membership_referendum&amp;amp;oldid=896735054\"&gt;Wikipedia&lt;/a&gt;. These can be used to practice a variety of inference and modeling concepts, including confidence intervals, p-values, hierarchical models and forecasting.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;data(brexit_polls)\r\n\r\n# plot of Brexit referendum polling spread between &amp;quot;Remain&amp;quot; and &amp;quot;Leave&amp;quot; over time\r\nbrexit_polls %&amp;gt;%\r\n    ggplot(aes(enddate, spread, color = poll_type)) +\r\n    geom_hline(aes(yintercept = -.038, color = &amp;quot;Actual spread&amp;quot;)) +\r\n    geom_smooth(method = &amp;quot;loess&amp;quot;, span = 0.4) +\r\n    geom_point() +\r\n    scale_color_manual(values = colorblind_palette[1:3]) +\r\n    xlab(&amp;quot;Poll end date (2016)&amp;quot;) +\r\n    ylab(&amp;quot;Spread (Proportion Remain - Proportion Leave)&amp;quot;) +\r\n    labs(color = &amp;quot;Poll type&amp;quot;) +\r\n    ggtitle(&amp;quot;Spread of Brexit referendum online and telephone polls&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-07-19-more-datasets-for-teaching-data-science-the-expanded-dslabs-package_files/figure-html/unnamed-chunk-10-1.png\" width=\"672\" style=\"display: block; margin: auto;\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"breast-cancer-diagnosis-prediction-brca\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Breast cancer diagnosis prediction: &lt;code&gt;brca&lt;/code&gt;&lt;/h2&gt;\r\n&lt;p&gt;This is the &lt;a href=\"https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29\"&gt;Breast Cancer Wisconsin (Diagnostic) Dataset&lt;/a&gt;, a classic machine learning dataset that allows classification of breast lesion biopsies as malignant or benign based on cell nucleus characteristics extracted from digitized images of fine needle aspirate cytology slides. The data are appropriate for principal component analysis and a variety of machine learning algorithms. Models can be trained to a predictive accuracy of over 95%.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;# scale x values\r\nx_centered &amp;lt;- sweep(brca$x, 2, colMeans(brca$x))\r\nx_scaled &amp;lt;- sweep(x_centered, 2, colSds(brca$x), FUN = &amp;quot;/&amp;quot;)\r\n\r\n# principal component analysis\r\npca &amp;lt;- prcomp(x_scaled) \r\n\r\n# scatterplot of PC2 versus PC1 with an ellipse to show the cluster regions\r\ndata.frame(pca$x[,1:2], type = ifelse(brca$y == &amp;quot;B&amp;quot;, &amp;quot;Benign&amp;quot;, &amp;quot;Malignant&amp;quot;)) %&amp;gt;%\r\n    ggplot(aes(PC1, PC2, color = type)) +\r\n    geom_point() +\r\n    stat_ellipse() +\r\n    ggtitle(&amp;quot;PCA separates breast biospies into benign and malignant clusters&amp;quot;)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-07-19-more-datasets-for-teaching-data-science-the-expanded-dslabs-package_files/figure-html/unnamed-chunk-11-1.png\" width=\"672\" style=\"display: block; margin: auto;\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"conclusion\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Conclusion&lt;/h2&gt;\r\n&lt;p&gt;We hope that these additional datasets make the &lt;a href=\"https://cran.r-project.org/web/packages/dslabs/index.html\"&gt;dslabs package&lt;/a&gt; even more useful for teaching data science through real-world case studies and motivating examples.&lt;/p&gt;\r\n&lt;p&gt;Are you an R programming novice but want to learn how to do all of this and more? Check out the &lt;a href=\"https://www.edx.org/professional-certificate/harvardx-data-science\"&gt;Data Science Professional Certificate Program from HarvardX&lt;/a&gt; on edX, taught by Rafael Irizarry!&lt;/p&gt;\r\n&lt;/div&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/zawn1oAM46c\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/07/19/more-datasets-for-teaching-data-science-the-expanded-dslabs-package/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>Research quality data and research quality databases</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/oIIGzGzQVpU/</link>\r\n      <pubDate>Wed, 29 May 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/05/29/research-quality-data-and-research-quality-databases/</guid>\r\n      <description>&lt;p&gt;When you are doing data science, you are doing research. You want to use data to answer a question, identify a new pattern, improve a current product, or come up with a new product. The common factor underlying each of these tasks is that you want to use the data to answer a question that you haven’t answered before. The most effective process we have come up for getting those answers is the scientific research process. That is why the key word in data science is not data, it is science.&lt;/p&gt;\r\n\r\n&lt;p&gt;No matter where you are doing data science - in academia, in a non-profit, or in a company - you are doing research. The data is the substrate you use to get the answers you care about. The first step most people take when using data is to collect the data and store it. This is a data engineering problem and is a necessary first step before you can do data science. But the state and quality of the data you have can make a huge amount of difference in how fast and accurately you can get answers. If the data is structured for analysis - if it is research quality  - then it makes getting answers dramatically faster.&lt;/p&gt;\r\n\r\n&lt;p&gt;A common analogy says that &lt;a href=\"https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data\"&gt;data is the new oil&lt;/a&gt;. Using this analogy pulling the data from all of the different available sources is like mining and extracting the oil. Putting it in a data lake or warehouse is like storing the crude oil for use in different products. In this analogy research is like getting the cars to go using the oil. Crude oil extracted from the ground can be used for a lot of different products, but to make it really useful for cars you need to refine the oil into gas. Creating research quality data is the way that you refine and structure data to make it conducive to doing science. It means that the data is no longer as general purpose, but it means you can use it much, much more efficiently for the purpose you care about - getting answers to your questions.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;Research quality data&lt;/em&gt; is data that:&lt;/p&gt;\r\n\r\n&lt;ol&gt;\r\n&lt;li&gt;Is summarized the right amount&lt;/li&gt;\r\n&lt;li&gt;Is formatted to work with the tools you are going to use&lt;/li&gt;\r\n&lt;li&gt;Is easy to manipulate and use&lt;/li&gt;\r\n&lt;li&gt;Is valid and accurately reflects the underlying data collection&lt;/li&gt;\r\n&lt;li&gt;Has potential biases clearly documented.&lt;/li&gt;\r\n&lt;li&gt;Combines all the relevant data types you need to answer questions&lt;/li&gt;\r\n&lt;/ol&gt;\r\n\r\n&lt;p&gt;Let’s use an example to make this concrete. Suppose that you want to analyze data from an electronic health record. You want to do this to identify new potential efficiencies, find new therapies, and understand variation in prescribing within your medical system. The data that you have collected is in the form of billing records. They might be stored in a large database for a health system, where each record looks something like this:&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;img src=\"http://healthdesignchallenge.com/images/status-quo.png\" alt=\"\" /&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;An example electronic health record. Source: &lt;a href=\"http://healthdesignchallenge.com/\"&gt;http://healthdesignchallenge.com/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;These data are collected incidentally during the health process and are designed for billing, not for research. Often they contain information about what treatments patients received and were billed for, but they might not include information on the health of the patient and whether they had any health complications or relapses they weren’t billed for.&lt;/p&gt;\r\n\r\n&lt;p&gt;These data are great, but they aren’t research grade. They aren’t summarized in any meaningful way, can’t be manipulated with visualization or machine learning tools, are unwieldy and contain a lot of information we don’t need, are subject to all sorts of strange sampling biases, and aren’t merged with any of the health outcome data you might care about.&lt;/p&gt;\r\n\r\n&lt;p&gt;So let’s talk about how we would turn this pile of crude data into research quality data.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;img src=\"https://user-images.githubusercontent.com/1571674/58572594-f77d2080-8209-11e9-87a2-0621a13eeb03.png\" alt=\"\" /&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;Turning raw data into research quality data.&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Summarizing the data the right amount&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;To know how to summarize the data we need to know what are the most common types of questions we want to answer and what resolution we need to answer them. A good idea is to summarize things at the finest unit of analysis you think you will need - it is always easier to aggregate than disaggregate at the analysis level. So we might summarize at the patient and visit level. This would give us a data set where everything is indexed by patient and visit. If we want to answer something at a clinic, physician, or hospital level we can always aggregate there.&lt;/p&gt;\r\n\r\n&lt;p&gt;We also need to choose what to quantify. We might record for each visit the date, prescriptions with standardized codes, tests, and other metrics. Depending on the application we may store the free text of the physician notes as a text string - for potential later processing into specific tokens or words. Or if we already have a system for aggregating physicians notes we could apply it at this stage.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Is formatted to work with the tools you are going to use&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;Research quality data is organized so the most frequent tasks can be completed quickly and without large amounts of data processing and reformatting. Each data analytic tool has different requirements on the type of data you need to input. For example, many statistical modeling tools use “tidy data” so you might store the summarized data in a single tidy data set or a set of tidy data tables linked by a common set of indicators. Some software (for example in the analysis of human genomic data) require inputs in different formats - say as a set of objects in the R programming language. Others, like software to fit a convolutional neural network to a set of images, might require a set of image files organized in a directory in a particular way along with a metadata file providing information about each set of images. Or we might need to &lt;a href=\"https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/\"&gt;one hot encode&lt;/a&gt; categories that need to be classified.&lt;/p&gt;\r\n\r\n&lt;p&gt;In the case of our EHR data we might store everything in a set of tidy tables that can be used to quickly correlate different measurements. If we are going to integrate imaging, lab reports, and other documents we might store those in different formats to make integration with downstream tools easier.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Is easy to manipulate and use&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;This seems like it is just a re-hash of formatting the data to work with the tools you care about, but there are some subtle nuances. For example, if you have a huge amount of data (petabyes of images, for example) you might not want to do research on all of those data at once. It will be inefficient and expensive. So you might use sampling to get a smaller data set for your research quality data that is easier to use and manipulate. The data will also be easier to use if they are (a) stored in an easy to access database with security systems well documented, (b) have a data dictionary that makes it clear what the data are and where they come from, or &amp;copy; have a clear set of tutorials on how to perform common tasks on the data.&lt;/p&gt;\r\n\r\n&lt;p&gt;In our EHR example you might include a data dictionary that describes the dates of the data pull, the types of data pulled, the type of processing performed, and pointers to the scripts that pulled the data.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Is valid and accurately reflects the underlying data collection&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;Data can be invalid for a whole host of reasons. The data could be incorrectly formatted, input with error, could change over time, could be mislabeled, and more. All of these problems can occur on the original data pull or over time. Data can also be out of date as new data becomes available.&lt;/p&gt;\r\n\r\n&lt;p&gt;The research quality database should include only data that has been checked, validated, cleaned and QA’d so that it reflects the real state of the world. This process is not a one time effort, but an ongoing set of code, scripts, and processes that ensure the data you use for research are as accurate as possible.&lt;/p&gt;\r\n\r\n&lt;p&gt;In the EHR example there would be a series of data pulls, code to perform checks, and comparisons to additional data sources to validate the values, levels, variables, and other components of the research quality database.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Has potential biases clearly documented&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;A research quality data set is by definition a derived data set. So there is a danger that problems with the data will be glossed over since it has been processed and easy to use. To avoid this problem, there has to be documentation on where the data came from, what happened to them during processing, and any potential problems with the data.&lt;/p&gt;\r\n\r\n&lt;p&gt;With our EHR example this could include issues about how patients come into the system, what procedures can be billed (or not), what data was ignored in the research quality database, what are the time periods the data were collected, and more.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Combines all the relevant data types you need to answer questions&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;One big difference between a research quality data set/database and a raw database or even a general purpose tidy data set, is that it merges all of the relevant data you need to answer specific questions, even if they come from distinct sources. Research quality data pulls together and makes easy to access, all the information you need to answer your questions. This could still be in the form of a relational database - but the databases organization is driven by the research question, rather than driven by other purposes.&lt;/p&gt;\r\n\r\n&lt;p&gt;For example, EHR data may already be stored in a relational database. But it is stored in a way that makes it easy to understand billing and patient flow in a clinic. To answer a research question you might need to combine the billing data, with patient outcome data, and prescription fulfillment data, all processed and indexed so they are either already merged or can be easily merged.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Why do this?&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;So why build a research quality data set? It sure seems like a lot of work (and it is!). The reason is that this work will always be done, one way or the other. If you don&amp;rsquo;t invest in making a research quality data set up front, you will do it as a thousand papercuts over time. Each time you need to answer a new question or try a different model you&amp;rsquo;ll be slowed down by the friction of identifying, creating, and checking a new cleaned up data set. On the one hand this amortizes the work over the course of many projects. But by doing it piecemeal you also dramatically increase the chance of an error in processing, reduce answer time, slow down the research process, and make the investment for any individual project much higher.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Problem Forward Data Science&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;If you want help planning or building a research quality data set or database, we can help at &lt;a href=\"https://simplystatistics.org/2019/05/20/i-co-founded-a-company-meet-problem-forward-data-science/\"&gt;Problem Forward Data Science&lt;/a&gt;. Get in touch here: &lt;a href=\"https://problemforward.typeform.com/to/L4h89P\"&gt;https://problemforward.typeform.com/to/L4h89P&lt;/a&gt;&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/oIIGzGzQVpU\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/05/29/research-quality-data-and-research-quality-databases/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>I co-founded a company! Meet Problem Forward Data Science</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/DD1vBM-g654/</link>\r\n      <pubDate>Mon, 20 May 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/05/20/i-co-founded-a-company-meet-problem-forward-data-science/</guid>\r\n      <description>&lt;p&gt;I have some exciting news about something I&amp;rsquo;ve been working on for the last year or so. I started a company! It&amp;rsquo;s called &lt;a href=\"https://www.problemforward.com/\"&gt;Problem Forward&lt;/a&gt; data science.  I&amp;rsquo;m pumped about this new startup for a lot of reasons.&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;My co-founder is one of my families closest friends, Jamie McGovern, who has more than 2 decades of experience in the consulting world and who I&amp;rsquo;ve known for 15 years.&lt;/li&gt;\r\n&lt;li&gt;We are creating a cool new model of &amp;ldquo;data scientist as a service&amp;rdquo; (more on that below)&lt;/li&gt;\r\n&lt;li&gt;We have a &lt;a href=\"https://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/\"&gt;problem forward, not solution backward&lt;/a&gt; approach to data science that grew out of the Hopkins philosophy of data science.&lt;/li&gt;\r\n&lt;li&gt;We are headquartered in East Baltimore and are creating awesome new tech jobs in a place where they haven&amp;rsquo;t been historically.&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Problem Forward, Not Solution Backward&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;We have always had a &amp;ldquo;problem forward, not solution backward&amp;rdquo; approach to statistics, machine learning and data here at Simply Stats. This has grown out of the Johns Hopkins Biostatistics philosophy of starting with the public health or medical problem you care about and working back to the statistical models, software, and tools you need to solve it.&lt;/p&gt;\r\n\r\n&lt;p&gt;This idea is so important to us, it is in the name of the company. When we work with people our first goal is to find out the problems and questions that they genuinely care about, then work backward to figure out how to solve them. We don&amp;rsquo;t come in with a particular predetermined algorithm or strategy. One of the first questions we ask people isn&amp;rsquo;t about data at all, it is:&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;What question do you wish you could answer about your business (ignoring if you have the data or not to answer it yet)?&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;My favorite example of this is &lt;a href=\"https://en.wikipedia.org/wiki/Moneyball_(film)\"&gt;Moneyball&lt;/a&gt;. This is one of the classic stories about how the Oakland A&amp;rsquo;s used data to gain a unique advantage. But one of the key messages about this story that often gets missed is that the data weren&amp;rsquo;t unique to the A&amp;rsquo;s! Everyone had the same data, the A&amp;rsquo;s just started with a &lt;em&gt;problem&lt;/em&gt; that they needed to solve. They needed to find a unique way to win games that wasn&amp;rsquo;t as expensive. Then they moved forward to looking at the data and realized that on base percentage was cheaper than home runs. So the A&amp;rsquo;s used a &amp;ldquo;problem forward, not solution backward&amp;rdquo; approach to data analysis.&lt;/p&gt;\r\n\r\n&lt;p&gt;Using this approach we have worked with companies with a wide variety of needs. Our main capabilities are in data strategy, data cleaning and research quality database generation, modeling and machine learning, and data views through dashboards, reports, and presentations.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;img src=\"https://user-images.githubusercontent.com/1571674/58030374-6eb90300-7aec-11e9-8d3d-bf4ef225af61.png\" alt=\"\" /&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Data Scientist as a Service&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;There are a huge number of data science platform companies out there. Some of them are producing awesome tools, but as any serious data analyst will tell you we are years from automating real data science. We are only very recently seeing formal definitions of what &lt;a href=\"https://arxiv.org/abs/1904.11907\"&gt;success of a data analysis&lt;/a&gt; even means! So it isn&amp;rsquo;t surprising when general purpose platforms like IBM Watson &lt;a href=\"https://www.statnews.com/2017/09/05/watson-ibm-cancer/\"&gt;struggle with specific problems - the problem isn&amp;rsquo;t specified clearly enough for a platform to solve it yet.&lt;/a&gt;.&lt;/p&gt;\r\n\r\n&lt;p&gt;The reason there are so many platforms is that its easy to sell the &amp;ldquo;cool&amp;rdquo; part of the problem - say building an AI to classify images or drive a car. But often the deeper problem is (a) figuring out what you even want to or can say with a set of data set, (b) collecting a set of disorganized data, &amp;copy; getting buy in from groups with different motivations and data sets, (d) organizing ugly data from different sources or finding new data you might need, and (e) putting your answers in context. These problems are more like &amp;ldquo;glue&amp;rdquo; that comes between each of the platforms. We have a phrase we like to use:&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;To solve your data problem you need a person, not a platform&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;So we have set up a &amp;ldquo;platform&amp;rdquo; that lets you scale up and down the number team members you have to solve data problems, just like you would scale up and down the number of servers or tools that you use on AWS.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;img src=\"https://user-images.githubusercontent.com/1571674/58030422-855f5a00-7aec-11e9-8d15-96074d74d7dd.png\" alt=\"\" /&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;This means if you are an early stage startup we can help you scale data science before you can afford to hire a whole team. Even if you are a non-profit or a small academic group we can scale up or down to suit your needs. And if you are a big company we can provide utility data science for projects with tight deadlines.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Working with friends and building East Baltimore&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;The thing that gets me most excited about this new adventure is working with my really close friend Jamie. It&amp;rsquo;s been huge for me to learn about the ins and outs of starting and running a business with someone who has decades of experience in the consulting industry.&lt;/p&gt;\r\n\r\n&lt;p&gt;It&amp;rsquo;s also exciting to be able to headquarter the company right in East Baltimore and to work to upskill and develop talent here in a neighborhood I care about.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;strong&gt;Like what you hear? Get in touch&lt;/strong&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;If you are looking for data science work we&amp;rsquo;d love to hear from you! Whether you are an academic, a non-profit, a small startup, or a big business our utility model means we can work with you.&lt;/p&gt;\r\n\r\n&lt;p&gt;If you are interested in working with us contact us here:&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;a href=\"https://problemforward.typeform.com/to/L4h89P\"&gt;https://problemforward.typeform.com/to/L4h89P&lt;/a&gt;&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/DD1vBM-g654\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/05/20/i-co-founded-a-company-meet-problem-forward-data-science/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>Generative and Analytical Models for Data Analysis</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/ocd6QanN5iA/</link>\r\n      <pubDate>Mon, 29 Apr 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/04/29/generative-and-analytical-models-for-data-analysis/</guid>\r\n      <description>&lt;p&gt;Describing how a data analysis is created is a topic of keen interest to me and there are a few different ways to think about it. Two different ways of thinking about data analysis are what I call the “generative” approach and the “analytical” approach. Another, more informal, way that I like to think about these approaches is as the “biological” model and the “physician” model. Reading through the literature on the process of data analysis, I’ve noticed that many seem to focus on the former rather than the latter and I think that presents an opportunity for new and interesting work.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"generative-model\"&gt;Generative Model&lt;/h2&gt;\r\n\r\n&lt;p&gt;The generative approach to thinking about data analysis focuses on the process by which an analysis is created. Developing an understanding of the decisions that are made to move from step one to step two to step three, etc. can help us recreate or reconstruct a data analysis. While reconstruction may not exactly be the goal of studying data analysis in this manner, having a better understanding of the process can open doors with respect to improving the process.&lt;/p&gt;\r\n\r\n&lt;p&gt;A key feature of the data analytic process is that it typically takes place inside the data analyst’s head, making it impossible to directly observe. Measurements can be taken by asking analysts what they were thinking at a given time, but that can be subject to a variety of measurement errors, as with any data that depend on a subject’s recall. In some situations, partial information is available, for example if the analyst writes down the thinking process through a series of reports or if a team is involved and there is a record of communication about the process. From this type of information, it is possible to gather a reasonable picture of “how things happen” and to describe the process for generating a data analysis.&lt;/p&gt;\r\n\r\n&lt;p&gt;This model is useful for understanding the “biological process”, i.e. the underlying mechanisms for how data analyses are created, sometimes referred to as &lt;a href=\"https://projecteuclid.org/euclid.ss/1009212754\"&gt;“statistical thinking”&lt;/a&gt;. There is no doubt that this process has inherent interest for both teaching purposes and for understanding applied work. But there is a key ingredient that is lacking and I will talk about that more below.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"analytical-model\"&gt;Analytical Model&lt;/h2&gt;\r\n\r\n&lt;p&gt;A second approach to thinking about data analysis ignores the underlying processes that serve to generate the data analysis and instead looks at the observable outputs of the analysis. Such outputs might be an R markdown document, a PDF report, or even a slide deck (Stephanie Hicks and I refer to this as the &lt;a href=\"https://arxiv.org/abs/1903.07639\"&gt;analytic container&lt;/a&gt;). The advantage of this approach is that the analytic outputs are real and can be directly observed. Of course, what an analyst puts into a report or a slide deck typically only represents a fraction of what might have been produced in the course of a full data analysis. However, it’s worth noting that the elements placed in the report are the &lt;em&gt;cumulative result&lt;/em&gt; of all the decisions made through the course of a data analysis.&lt;/p&gt;\r\n\r\n&lt;p&gt;I’ve used music theory as an analogy for data analysis &lt;a href=\"https://youtu.be/qFtJaq4TlqE\"&gt;many times before&lt;/a&gt;, mostly because&amp;hellip;it’s all I know, but also because it really works! When we listen to or examine a piece of music, we have essentially no knowledge of how that music came to be. We can no longer interview Mozart or Beethoven about how they wrote their music. And yet we are still able to do a few important things:&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;&lt;em&gt;Analyze and Theorize&lt;/em&gt;. We can analyze the music that we hear (and their written representation, if available) and talk about how different pieces of music differ from each other or share similarities. We might develop a sense of what is commonly done by a given composer, or across many composers, and evaluate what outputs are more successful or less successful. It’s even possible to draw connections between different kinds of music separated by centuries. None of this requires knowledge of the underlying processes.&lt;/li&gt;\r\n&lt;li&gt;&lt;em&gt;Give Feedback&lt;/em&gt;. When students are learning to compose music, an essential part of that training is the play the music in front of others. The audience can then give feedback about what worked and what didn’t. Occasionally, someone might ask “What were you thinking?” but for the most part, that isn’t necessary. If something is truly broken, it’s sometimes possible to prescribe some corrective action (e.g. “make this a C chord instead of a D chord”).&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;There are even two whole podcasts dedicated to analyzing music&amp;mdash;&lt;a href=\"https://stickynotespodcast.libsyn.com\"&gt;Sticky Notes&lt;/a&gt; and &lt;a href=\"https://www.switchedonpop.com\"&gt;Switched on Pop&lt;/a&gt;&amp;mdash;and they generally do not interview the artists involved (this would be particularly hard for Sticky Notes). By contrast, the &lt;a href=\"http://songexploder.net\"&gt;Song Exploder&lt;/a&gt; podcast takes a more “generative approach” by having the artist talk about the creative process.&lt;/p&gt;\r\n\r\n&lt;p&gt;I referred to this analytical model for data analysis as the “physician” approach because it mirrors, in a basic sense, the problem that a physician confronts. When a patient arrives, there is a set of symptoms and the patient’s own report/history. Based on that information, the physician has to prescribe a course of action (usually, to collect more data). There is often little detailed understanding of the biological processes underlying a disease, but they physician may have a wealth of personal experience, as well as a literature of clinical trials comparing various treatments from which to draw. In human medicine, knowledge of biological processes is critical for designing new interventions, but may not play as large a role in prescribing specific treatments.&lt;/p&gt;\r\n\r\n&lt;p&gt;When I see a data analysis, as a teacher, a peer reviewer, or just a colleague down the hall, it is usually my job to give feedback in a timely manner. In such situations there usually isn’t time for extensive interviews about the development process of the analysis, even though that might in fact be useful. Rather, I need to make a judgment based on the observed outputs and perhaps some brief follow-up questions. To the extent that I can provide feedback that I think will improve the quality of the analysis, it is because I have a sense of what makes for a &lt;em&gt;successful&lt;/em&gt; analysis.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"the-missing-ingredient\"&gt;The Missing Ingredient&lt;/h2&gt;\r\n\r\n&lt;p&gt;Stephanie Hicks and I have discussed what are the elements of a data analysis as well as what might be the &lt;a href=\"https://arxiv.org/abs/1903.07639\"&gt;principles&lt;/a&gt; that guide the development of an analysis. In a &lt;a href=\"https://arxiv.org/abs/1904.11907\"&gt;new paper&lt;/a&gt;, we describe and characterize the &lt;em&gt;success&lt;/em&gt; of a data analysis, based on a matching of principles between the analyst and the audience. This is something I have touched on previously, both &lt;a href=\"https://simplystatistics.org/2018/04/17/what-is-a-successful-data-analysis/\"&gt;in this blog&lt;/a&gt; and on &lt;a href=\"http://nssdeviations.com/\"&gt;my podcast with Hilary Parker&lt;/a&gt;, but in a generally more hand-wavey fashion. Developing a more formal model, as Stephanie and I have done here, has been useful and has provided some additional insights.&lt;/p&gt;\r\n\r\n&lt;p&gt;For both the generative model and the analytical model of data analysis, the missing ingredient was a clear definition of what made a data analysis &lt;em&gt;successful&lt;/em&gt;. The other side of that coin, of course, is knowing when a data analysis has failed. The analytical approach is useful because it allows us to separate the analysis from the analyst and to categorize analyses according to their observed features. But the categorization is “unordered” unless we have some notion of success. Without a definition of success, we are unable to formally criticize analyses and explain our reasoning in a logical manner.&lt;/p&gt;\r\n\r\n&lt;p&gt;The generative approach is useful because it reveals potential targets of intervention, especially from a teaching perspective, in order to improve data analysis (just like understanding a biological process). However, without a concrete definition of success, we don’t have a target to strive for and we do not know how to intervene in order to make genuine improvement. In other words, there is no outcome on which we can “train our model” for data analysis.&lt;/p&gt;\r\n\r\n&lt;p&gt;I mentioned above that there is a lot of focus on developing the generative model for data analysis, but comparatively little work developing the analytical model. Yet, both models are fundamental to improving the quality of data analyses and learning from previous work. I think this presents an important opportunity for statisticians, data scientists, and others to study how we can characterize data analyses based on observed outputs and how we can draw connections between analyses.&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/ocd6QanN5iA\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/04/29/generative-and-analytical-models-for-data-analysis/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>Tukey, Design Thinking, and Better Questions</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/P6I1o3CFvzQ/</link>\r\n      <pubDate>Wed, 17 Apr 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/04/17/tukey-design-thinking-and-better-questions/</guid>\r\n      <description>&lt;p&gt;Roughly once a year, I read John Tukey’s paper &lt;a href=\"https://projecteuclid.org/euclid.aoms/1177704711\"&gt;“The Future of Data Analysis”&lt;/a&gt;, originally published in 1962 in the &lt;em&gt;Annals of Mathematical Statistics&lt;/em&gt;. I’ve been doing this for the past 17 years, each time hoping to really understand what it was he was talking about. Thankfully, each time I read it I seem to get &lt;em&gt;something&lt;/em&gt; new out of it. For example, in 2017 I wrote &lt;a href=\"https://youtu.be/qFtJaq4TlqE\"&gt;a whole talk&lt;/a&gt; around some of the basic ideas.&lt;/p&gt;\r\n&lt;p&gt;Well, it’s that time of year again, and I’ve been doing some reading.&lt;/p&gt;\r\n&lt;p&gt;Probably the most famous line from this paper is&lt;/p&gt;\r\n&lt;blockquote&gt;\r\n&lt;p&gt;Far better an approximate answer to the &lt;em&gt;right&lt;/em&gt; question, which is often vague, than an &lt;em&gt;exact&lt;/em&gt; answer to the wrong question, which can always be made precise.&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n&lt;p&gt;The underlying idea in this sentence arises in at least two ways in Tukey’s paper. First is his warning that statisticians should not be called upon to produce the “right” answers. He argues that the idea that statistics is a “monolithic, authoritarian structure designed to produce the ‘official’ results” presents a “real danger to data analysis”. Second, Tukey criticizes the idea that much of statistical practice centers around optimizing statistical methods around precise (and inadequate) criteria. One can feel free to identify a method that minimizes mean squared error, but that should not be viewed as the &lt;em&gt;goal&lt;/em&gt; of data analysis.&lt;/p&gt;\r\n&lt;p&gt;But that got me thinking—what &lt;em&gt;is&lt;/em&gt; the ultimate goal of data analysis? In 64 pages of writing, I’ve found it difficult to identify a sentence or two where Tukey describes the ultimate goal, why it is we’re bothering to analyze all this data. It occurred to me in this year’s reading of the paper, that maybe the reason Tukey’s writing about data analysis is often so confusing to me is because his goal is actually quite different from that of the rest of us.&lt;/p&gt;\r\n&lt;div id=\"more-questions-better-questions\" class=\"section level2\"&gt;\r\n&lt;h2&gt;More Questions, Better Questions&lt;/h2&gt;\r\n&lt;p&gt;Most of the time in data analysis, we are trying to answer a question with data. I don’t think it’s controversial to say that, but maybe that’s the wrong approach? Or maybe, that’s what we’re &lt;em&gt;not&lt;/em&gt; trying to do at first. Maybe what we spend most of our time doing is figuring out a better question.&lt;/p&gt;\r\n&lt;p&gt;Hilary Parker and I have discussed at length the idea of design thinking on &lt;a href=\"http://nssdeviations.com\"&gt;our podcast&lt;/a&gt;. One of the fundamental ideas from design thinking involves identifying the problem. It’s the first “diamond” in the &lt;a href=\"https://simplystatistics.org/2018/09/14/divergent-and-convergent-phases-of-data-analysis/\"&gt;“double diamond”&lt;/a&gt; approach to design.&lt;/p&gt;\r\n&lt;p&gt;Tukey describes the first three steps in a data analysis as:&lt;/p&gt;\r\n&lt;ol style=\"list-style-type: decimal\"&gt;\r\n&lt;li&gt;Recognition of problem&lt;/li&gt;\r\n&lt;li&gt;One technique used&lt;/li&gt;\r\n&lt;li&gt;Competing techniques used&lt;/li&gt;\r\n&lt;/ol&gt;\r\n&lt;p&gt;In other words, try one approach, then try a bunch of other approaches! You might be thinking, why not just try &lt;em&gt;the best&lt;/em&gt; approach (or perhaps the &lt;em&gt;right&lt;/em&gt; approach) and save yourself all that work? Well, that’s the kind of path you go down when you’re trying to answer the question. Stop doing that! There are two reasons why you should stop thinking about answering the question:&lt;/p&gt;\r\n&lt;ol style=\"list-style-type: decimal\"&gt;\r\n&lt;li&gt;You’re probably asking the wrong question anyway, so don’t take yourself too seriously;&lt;/li&gt;\r\n&lt;li&gt;The “best” approach is only defined as “best” according to some arbitrary criterion that probably isn’t suitable for your problem/question.&lt;/li&gt;\r\n&lt;/ol&gt;\r\n&lt;p&gt;After thinking about all this I was inspired to draw the following diagram.&lt;/p&gt;\r\n&lt;div class=\"figure\"&gt;\r\n&lt;img src=\"https://simplystatistics.org/post/2019-04-17-tukey-design-thinking-and-better-questions_files/question_evidence.png\" alt=\"Strength of Evidence vs. Quality of Question\" /&gt;\r\n&lt;p class=\"caption\"&gt;Strength of Evidence vs. Quality of Question&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;p&gt;The goal in this picture is to get to the upper right corner, where you have a high quality question and very strong evidence. In my experience, most people assume that they are starting in the bottom right corner, where the quality of the question is at its highest. In that case, the only thing left to do is to choose the optimal procedure so that you can squeeze as much information out of your data. The reality is that we almost always start in the bottom left corner, with a vague and poorly defined question and a similarly vague sense of what procedure to use. In that case, what’s a data scientist to do?&lt;/p&gt;\r\n&lt;p&gt;In my view, the most useful thing a data scientist can do is to devote serious effort towards improving the quality and sharpness of the question being asked. On the diagram, the goal is to move us as much as possible to the right hand side. Along the way, we will look at data, we will consider things outside the data like context, resources and subject matter expertise, and we will try a bunch of different procedures (some optimal, some less so).&lt;/p&gt;\r\n&lt;p&gt;Ultimately, we will develop some of idea of what the data tell us, but more importantly we will have a better sense of what kinds of questions we can ask of the data and what kinds of questions we actually want to have answered. In other words, we can learn more about ourselves by looking at the data.&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"exploring-the-data\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Exploring the Data&lt;/h2&gt;\r\n&lt;p&gt;It would seem that the message here is that the goal of data analysis is to explore the data. In other words, data analysis &lt;em&gt;is&lt;/em&gt; exploratory data analysis. Maybe this shouldn’t be so surprising given that Tukey &lt;a href=\"https://en.wikipedia.org/wiki/Exploratory_data_analysis\"&gt;wrote the book&lt;/a&gt; on exploratory data analysis. In this paper, at least, he essentially dismisses other goals as overly optimistic or not really meaningful.&lt;/p&gt;\r\n&lt;p&gt;For the most part I agree with that sentiment, in the sense that looking for “the answer” in a single set of data is going to result in disappointment. At best, you will accumulate evidence that will point you in a new and promising direction. Then you can iterate, perhaps by collecting new data, or by asking different questions. At worst, you will conclude that you’ve “figured it out” and then be shocked when someone else, looking at another dataset, concludes something completely different. In light of this, discussions about p-values and statistical significance are very much beside the point.&lt;/p&gt;\r\n&lt;p&gt;The following is from the very opening of Tukey’s book *Exploratory Data Analysis:&lt;/p&gt;\r\n&lt;blockquote&gt;\r\n&lt;p&gt;It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n&lt;p&gt;(Note that the all caps are originally his!) Given this, it’s not too surprising that Tukey seems to equate exploratory data analysis with essentially all of data analysis.&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"better-questions\" class=\"section level2\"&gt;\r\n&lt;h2&gt;Better Questions&lt;/h2&gt;\r\n&lt;p&gt;There’s one story that, for me, totally captures the spirit of exploratory data analysis. Legend has it that Tukey once asked a student what were the benefits of the &lt;a href=\"https://en.wikipedia.org/wiki/Median_polish\"&gt;median polish technique&lt;/a&gt;, a technique he invented to analyze two-way tabular data. The student dutifully answered that the benefit of the technique is that it provided summaries of the rows and columns via the row- and column-medians. In other words, like any good statistical technique, it &lt;em&gt;summarized the data&lt;/em&gt; by reducing it in some way. Tukey fired back, saying that this was incorrect—the benefit was that the technique created &lt;em&gt;more data&lt;/em&gt;. That “more data” was the residuals that are leftover in the table itself after running the median polish. It is the residuals that really let you learn about the data, discover whether there is anything unusual, whether your question is well-formulated, and how you might move on to the next step. So in the end, you got row medians, column medians, &lt;em&gt;and&lt;/em&gt; residuals, i.e. more data.&lt;/p&gt;\r\n&lt;p&gt;If a good exploratory technique gives you more data, then maybe good exploratory data analysis gives you more questions, or &lt;em&gt;better&lt;/em&gt; questions. More refined, more focused, and with a sharper point. The benefit of developing a sharper question is that it has a greater potential to provide discriminating information. With a vague question, the best you can hope for is a vague answer that may not lead to any useful decisions. Exploratory data analysis (or maybe just &lt;em&gt;data analysis&lt;/em&gt;) gives you the tools that let the data guide you towards a better question.&lt;/p&gt;\r\n&lt;/div&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/P6I1o3CFvzQ\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/04/17/tukey-design-thinking-and-better-questions/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>Interview with Abhi Datta</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/Qt8ZKqgv9C4/</link>\r\n      <pubDate>Mon, 01 Apr 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/04/01/interview-with-abhi-datta/</guid>\r\n      <description>&lt;p&gt;&lt;em&gt;Editor’s note: This is the next in our series of interviews with early career statisticians and data scientists. Today we are talking to Abhi Datta about his work in large scale spatial analysis and his interest in soccer! Follow him on Twitter at &lt;a href=\"https://twitter.com/datta_science\"&gt;@datta_science&lt;/a&gt;. If you have recommendations of an (early career) person in academics or industry you would like to see promoted, reach out to Jeff (@jtleek) on Twitter!&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: Do you consider yourself a statistician, biostatistician, data scientist, or something else?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;AD: That is a difficult question for me, as I enjoy working on theory, methods and data analysis and have co-authored diverse papers ranging from theoretical expositions to being primarily centered around a complex data analysis. My research interests also span a wide range of areas. A lot of my work on spatial statistics is driven by applications in environmental health and air pollution. Another significant area of my research is developing Bayesian models for epidemiological applications using survey data.&lt;/p&gt;\r\n\r\n&lt;p&gt;I would say what I enjoy most is developing statistical methodology motivated by a complex application where current methods fall short, applying the method for analysis of the motivating data, and trying to see if it is possible to establish some guarantees about the method through a combination of theoretical studies and empirical experiments that will help to generalize applicability of the method for other datasets. Of course, not all projects involve all the steps, but that is my ideal workflow. Not sure what that classifies me as.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: How did you get into statistics? What was your path to ending up at Hopkins?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;AD: I was born and grew up in Kolkata, India. I had the option of going for engineering, medical or statistics undergrad. I chose statistics persuaded by my appreciation for mathematics and the reputation of the statistics program at Indian Statistical Institute (ISI), Kolkata. I completed my undergrad (BStat) and Masters (MStat) in Statistics from ISI and I’m thankful I made that choice as those 5 years at ISI played a pivotal role in my life. Besides getting rigorous training in the foundations of statistics, most importantly, I met my wife &lt;a href=\"https://twitter.com/DrDebashreeRay\"&gt;Dr. Debashree Ray&lt;/a&gt; at ISI.&lt;/p&gt;\r\n\r\n&lt;p&gt;After my Masters, I had a brief stint in the finance industry, working for 2 years at Morgan Stanley (in Mumbai and then in New York City) before I joined the PhD program at the Division of Biostatistics at University of Minnesota (UMN) in 2012 where Debashree was pursuing her PhD in Biostatistics. I had initially planned to work in Statistical Genetics as I had done a research project in that area in my Master’s. However, I explored other research areas in my first year and ended up working on spatial statistics under the supervision of my advisor Dr. &lt;a href=\"http://sudipto.bol.ucla.edu/\"&gt;Sudipto Banerjee&lt;/a&gt;, and on high-dimensional data with my co-advisor&lt;a href=\"http://users.stat.umn.edu/~zouxx019/\"&gt;Dr. Hui Zou&lt;/a&gt; from the Department of Statistics in Minnesota. I graduated from Minnesota in 2016 and joined Hopkins Biostat as an Assistant Professor in the Fall of 2016.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: You work on large scale spatio-temporal modeling - how do you speed up computations for the bootstrap when the data are very large?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;AD: A main computational roadblock in spatio-temporal statistics is working with very big covariance matrices that strain memory and computing resources typically available in personal computers. &lt;a href=\"https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2015.1044091\"&gt;Previously&lt;/a&gt;, I have developed nearest neighbor Gaussian Processes (NNGP) &amp;ndash; a Bayesian hierarchical model for inference in massive geospatial datasets. One issue with hierarchical Bayesian models is their reliance on long sequential MCMC runs. Bootstrap, unlike MCMC, can be implemented in an embarrassingly parallel fashion. However, for geospatial data, all observations are correlated across space prohibiting direct resampling for bootstrap.&lt;/p&gt;\r\n\r\n&lt;p&gt;In a &lt;a href=\"https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.184\"&gt;recent work&lt;/a&gt; with my student Arkajyoti Saha, we proposed a semi-parametric bootstrap for inference on large spatial covariance matrices. We use sparse Cholesky factors of spatial covariance matrices to approximately decorrelate the data before resampling for bootstrap. Arkajyoti has implemented this in an R-package &lt;a href=\"https://cran.r-project.org/web/packages/BRISC/index.html\"&gt;BRISC: Bootstrap for rapid inference on spatial covariances&lt;/a&gt;. BRISC is extremely fast and at the time of publication, to my knowledge, it was the only R-package that offered inference on all the spatial covariance parameters without using MCMC. The package can also be used simply for super-fast estimation and prediction in geo-statistics.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: You have a cool paper on mapping local and global trait variation in plant distributions, how did you get involved in that collaboration? Does your modeling have implications for people studying the impacts of climate change?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;AD: In my final year of PhD at UMN, I was awarded the Inter-Disciplinary Doctoral Fellowship &amp;ndash;  a fantastic initiative by the graduate school at UMN providing research and travel funding, and office space to work with an inter-disciplinary team of researchers on a collaborative project. In  my IDF, mentored by &lt;a href=\"https://www-users.cs.umn.edu/~baner029/\"&gt;Dr. Arindam Banerjee&lt;/a&gt; and &lt;a href=\"https://www.forestry.umn.edu/people/peter-b-reich\"&gt;Dr. Peter Reich&lt;/a&gt;, I worked with a group of climate modelers, ecologists and computer scientists from several institutions on a project whose eventual goal is to improve carbon projections from climate models.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;a href=\"https://www.pnas.org/content/114/51/E10937\"&gt;The paper&lt;/a&gt; you mention was aimed at improving the global characterization of plant traits (measurements). This is important as plant trait values are critical inputs to climate model. Even the largest plant trait database TRY offers poor geographical coverage with little or no data across many large geographical regions. We used the fast NNGP approach I had been developing in my PhD to spatially gap-fill the plant trait data to create a global map of important plant traits with proper uncertainty quantification. The collaboration was a great learning experience for me on how to conduct a complex data analysis, and how to communicate with scientists.&lt;/p&gt;\r\n\r\n&lt;p&gt;Currently, we are looking at ways to incorporate the uncertainty quantified trait values as inputs to Earth System Models (ESMs) – the land component of climate models. We hope that replacing single trait values with entire trait distributions as inputs to these models will help to better propagate the uncertainty and improve the final model projections.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: What project has you most excited at the moment?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;AD: There are two. I have been working with &lt;a href=\"https://twitter.com/ScottZeger\"&gt;Dr. Scott Zeger&lt;/a&gt; on a project lead by &lt;a href=\"https://www.jhsph.edu/faculty/directory/profile/2047/agbessi-amouzou\"&gt;Dr. Agbessi Amouzou&lt;/a&gt; in the Department of International Health at Hopkins aiming to estimate the cause-specific fractions (CSMF) of child mortality in Mozambique using family questionnaire data (verbal autopsy). Verbal autopsies are often used as a surrogate to full autopsy in many countries and there exists software that use these questionnaire data to predict a cause for every death. However, these software are usually trained on some standard training data and yield inaccurate predictions in local context. This problem is a special case of transfer learning where a model trained using data representing a standard population offers poor predictive accuracy when specific populations are of interest. We have developed a general approach for transfer learning of classifiers that uses the predictions from these verbal autopsy software and limited full autopsy data from the local population to provide improved estimates of cause-specific mortality fractions. The approach is very general and offers a parsimonious model-based solution to transfer learning and can be used in any other classification-based application.&lt;/p&gt;\r\n\r\n&lt;p&gt;The second project involves creating high-resolution space-time maps of particulate matter (PM2.5) in Baltimore. Currently a network of low-cost air pollution monitors is being deployed in Baltimore that promises to offer air pollution measurements at a much higher geospatial resolution than what is provided by EPA’s sparse regulatory monitoring network. I was awarded a Bloomberg American Health Initiative Spark award for working with &lt;a href=\"https://www.jhsph.edu/faculty/directory/profile/2928/kirsten-koehler\"&gt;Dr. Kirsten Koehler&lt;/a&gt; in the Department of Environmental Health and Engineering to combine the low-cost network data, the sparse EPA data and other land-use covariates to create uncertainty quantified maps of PM2.5 at an unprecedented spatial resolution. We have just started analyzing the first two months of data and I’m really looking forward to help create the end-product and understand how PM2.5 levels vary across the different neighborhoods in Baltimore.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: You have an interest in soccer and spatio temporal models have played an increasing role in soccer analytics. Have you thought about using your statistics skills to study soccer or do you try to avoid mixing professional work and being a fan?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;AD: Yes, I’m an avid soccer fan. I have travelled to Brazil in 2014 and Russia in 2018 to watch live games in the world cups. It also unfortunately means that I set my alarm to earlier times on weekends than on weekdays as the European league games start pretty early in US time.&lt;/p&gt;\r\n\r\n&lt;p&gt;However, until recent times, I’ve been largely ignorant of applications of spatio-temporal statistics in soccer analytics. I just finished teaching a Spatial Statistics course and one of the students presented a &lt;a href=\"https://arxiv.org/abs/1702.05662\"&gt;fascinating work&lt;/a&gt; he has done on predicting player’s scoring abilities using spatial statistics. I certainly plan to read more literature on this and maybe one day can contribute. Till then I remain a fan.&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/Qt8ZKqgv9C4\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/04/01/interview-with-abhi-datta/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>10 things R can do that might surprise you</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/Ho12mGatrgY/</link>\r\n      <pubDate>Wed, 13 Mar 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/03/13/10-things-r-can-do-that-might-surprise-you/</guid>\r\n      <description>&lt;p&gt;Over the last few weeks I&amp;rsquo;ve had a couple of interactions with folks from the computer science world who were pretty disparaging of the R programming language. A lot of the critism focused on perceived limitations of R to statistical analysis.&lt;/p&gt;\r\n\r\n&lt;p&gt;It&amp;rsquo;s true, R does have a hugely comprehensive list of analysis packages on &lt;a href=\"https://cran.r-project.org/\"&gt;CRAN&lt;/a&gt;, &lt;a href=\"http://bioconductor.org/\"&gt;Bioconductor&lt;/a&gt;, &lt;a href=\"https://neuroconductor.org/\"&gt;Neuroconductor&lt;/a&gt;, and &lt;a href=\"https://ropensci.org/\"&gt;ROpenSci&lt;/a&gt; as well as great package management. As I was having these conversations I realized that R has grown into a multi-purpose connective language for things beyond just data analysis. But that the functionality isn&amp;rsquo;t always as well known outside of the R community. So this post is about some of the ridiculously awesome features of R that may or may not be as widely known. Here are 10 things R can do you might not have known about, building on &lt;a href=\"https://twitter.com/kara_woo/status/1100908125396193281\"&gt;Kara&amp;rsquo;s great tweet thread about lighthearted things to do with R&lt;/a&gt;.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"1-you-can-write-reproducible-word-or-powerpoint-documents-from-r-markdown\"&gt;1. You can write reproducible Word or Powerpoint documents from R markdown&lt;/h2&gt;\r\n\r\n&lt;p&gt;The &lt;a href=\"https://rmarkdown.rstudio.com/\"&gt;rmarkdown&lt;/a&gt; package lets you &lt;a href=\"https://bookdown.org/yihui/rmarkdown/word-document.html\"&gt;create reproducible Word documents&lt;/a&gt; and &lt;a href=\"https://bookdown.org/yihui/rmarkdown/powerpoint-presentation.html\"&gt;reproducible Powerpoint Presentations&lt;/a&gt; from your R markdown code just by changing one line in the YAML!&lt;/p&gt;\r\n\r\n&lt;h2 id=\"2-you-can-build-and-host-interactive-web-apps-in-just-a-few-lines-of-code\"&gt;2. You can build and host interactive web apps in just a few lines of code&lt;/h2&gt;\r\n\r\n&lt;p&gt;In just a few lines of code you can create interactive web apps in R. For example, &lt;a href=\"https://gist.github.com/jtleek/3e1baac9a74ea81556c9e6d55743d7ea\"&gt;in just 36 lines of code&lt;/a&gt; you can create an interactive dashboard to explore your BMI in relation to the NHANES sample using the &lt;a href=\"https://rmarkdown.rstudio.com/flexdashboard/\"&gt;flexdashboard&lt;/a&gt; package.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"3-you-can-host-your-web-apps-in-one-more-line-of-r-code\"&gt;3. You can host your web apps in one more line of R code&lt;/h2&gt;\r\n\r\n&lt;p&gt;The other cool thing about building web apps in R is that you can get them up on the web with just another line or two of R code using the &lt;a href=\"https://cran.r-project.org/web/packages/rsconnect/index.html\"&gt;rsconnect&lt;/a&gt; package. You can put them up on your own server or, even easier, host them on a cloud server like &lt;a href=\"https://www.shinyapps.io/\"&gt;shinyapps.io&lt;/a&gt;.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"4-you-can-connect-to-almost-any-database-under-the-sun-and-pull-data-with-dplyr-dbplyr\"&gt;4. You can connect to almost any database under the sun and pull data with dplyr/dbplyr&lt;/h2&gt;\r\n\r\n&lt;p&gt;It is really easy to connect to almost any database (local or remote) using the &lt;a href=\"https://cran.r-project.org/web/packages/dbplyr/index.html\"&gt;dbplyr&lt;/a&gt; package. This makes it possible for an R user to work independently pulling data from &lt;a href=\"https://cfss.uchicago.edu/distrib001_database.html\"&gt;almost all common database types&lt;/a&gt;. You can also use specialized packages like &lt;a href=\"https://cran.r-project.org/web/packages/bigrquery/index.html\"&gt;bigrquery&lt;/a&gt; to work directly with BigQuery and other high performance data stores.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"5-you-can-use-the-same-dplyr-grammar-locally-or-on-data-on-multiple-different-data-stores\"&gt;5. You can use the same dplyr grammar locally or on data on multiple different data stores&lt;/h2&gt;\r\n\r\n&lt;p&gt;Once you learn how to do basic data tranforms with &lt;a href=\"https://dplyr.tidyverse.org/\"&gt;dplyr&lt;/a&gt;, you can apply the same code to analyze data locally on your computer or remotely on any of the above databases or data stores. This simplifies and unifies data manipulation across multiple different databases and languages.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"6-you-can-fit-deep-learning-models-with-keras-and-tensorflow\"&gt;6. You can fit deep learning models with keras and Tensorflow&lt;/h2&gt;\r\n\r\n&lt;p&gt;The &lt;a href=\"https://keras.rstudio.com/\"&gt;keras&lt;/a&gt; package allows you to fit both pre-trained and denovo deep learning models directly from R. You can also work with the direct &lt;a href=\"https://tensorflow.rstudio.com/\"&gt;TensorFlow&lt;/a&gt; interface to fit the same kind of models.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"7-you-can-build-apis-and-serve-them-from-r\"&gt;7. You can build APIs and serve them from R&lt;/h2&gt;\r\n\r\n&lt;p&gt;The &lt;a href=\"https://www.rplumber.io/\"&gt;plumbr&lt;/a&gt; R package lets you convert R functions to web APIs that can be integrated into downstream applications. If you have Rstudio Connect you can also &lt;a href=\"https://blog.rstudio.com/2017/08/03/rstudio-connect-v1-5-4-plumber/\"&gt;deploy them as easily&lt;/a&gt; as you deploy web apps.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"8-you-can-make-video-game-interfaces-with-r\"&gt;8. You can make video game interfaces with R&lt;/h2&gt;\r\n\r\n&lt;p&gt;Not only can you deploy web apps, you can make them into awesome video games in R. The &lt;a href=\"https://github.com/ColinFay/nessy\"&gt;nessy&lt;/a&gt; package lets you create NES looking Shiny apps and &lt;a href=\"https://lucy.shinyapps.io/classify/\"&gt;deploy them&lt;/a&gt; just like you would any other Shiny app.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"9-you-can-analyze-data-using-spark-clusters-right-from-r\"&gt;9. You can analyze data using Spark clusters right from R&lt;/h2&gt;\r\n\r\n&lt;p&gt;Want to fit big, gnarly machine learning models on huge data sets? You can do that right from R using the &lt;a href=\"https://spark.rstudio.com/\"&gt;sparklyr&lt;/a&gt; package. You can use spark on your Desktop or a monster Spark cluster.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"10-you-can-build-and-learn-r-interactively-in-r\"&gt;10. You can build and learn R interactively in R&lt;/h2&gt;\r\n\r\n&lt;p&gt;The &lt;a href=\"https://swirlstats.com/\"&gt;swirl&lt;/a&gt; package is an R package that lets you build interactive tutorials for R, right inside R.&lt;/p&gt;\r\n\r\n&lt;p&gt;This is by no means a comprehensive list. You can also connect to AWS Polly and write &lt;a href=\"https://github.com/seankross/ari\"&gt;text to speech synthesis&lt;/a&gt; software or build Shiny apps that &lt;a href=\"https://yihui.shinyapps.io/voice/\"&gt;respond to voice commands&lt;/a&gt; or build apps that let you combine deep learning and accelerometry data to cast &lt;a href=\"https://jhubiostatistics.shinyapps.io/cast_spells/\"&gt;Harry Potter spells&lt;/a&gt;. The point is that R has become much more than just a data analysis language (although its still good at that!) and being good at R opens the door to lots of practical and cool applications.&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/Ho12mGatrgY\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/03/13/10-things-r-can-do-that-might-surprise-you/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>Open letter to journal editors: dynamite plots must die</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/uhTy3UbMzZA/</link>\r\n      <pubDate>Thu, 21 Feb 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/02/21/dynamite-plots-must-die/</guid>\r\n      <description>&lt;p&gt;Statisticians have been pointing out the problem with dynamite plots, also known as bar and line graphs, for years. Karl Broman lists them as one of the &lt;a href=\"https://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/\"&gt;top ten worst graphs&lt;/a&gt;. The problem has even been documented in the peer reviewed literature. For example, &lt;a href=\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3087125/\"&gt;this British Journal of Pharmacology&lt;/a&gt; paper titled &lt;em&gt;Show the data, don’t conceal them&lt;/em&gt; was published in 2011.&lt;/p&gt;\r\n&lt;p&gt;However, despite all these efforts, dynamite plots continue to be ubiquitous in the scientific literature. Just open the latest issue of Nature, Science or Cell and you will likely see a few. In fact, in this &lt;a href=\"https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002128\"&gt;PLOS Biology paper&lt;/a&gt;, Tracey Weissgerber and co-authors perform a systmetic review of “top physiology journals” and find that “85.6% of papers included at least one bar graph”. They go on to recommend “training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies”. In my view, the training will be accelerated if editors implement a policy that requires authors to show the data or, if the dataset is too large, show the distribution of the data with boxplots, histograms or smooth density estimates.&lt;/p&gt;\r\n&lt;div id=\"whats-wrong-with-dynamite-plots\" class=\"section level2\"&gt;\r\n&lt;h2&gt;What’s wrong with dynamite plots&lt;/h2&gt;\r\n&lt;p&gt;Dynamite plots are used to compare measurements from two or more groups: cases and controls, for example. In a two group comparison, the plots are graphical representations of a grand total of 4 numbers, regardless of the sample size. The four numbers are the average and the standard error (or the standard deviation, it’s not always clear) for each group. Here is a simulated example comparing diastolic blood pressure for patients on a drug and placebo:&lt;/p&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-02-21-dynamite-plots-must-die_files/figure-html/dynamite-1.png\" width=\"384\" /&gt;&lt;/p&gt;\r\n&lt;p&gt;Stars are often added to point out that the differences are statistically significant.&lt;/p&gt;\r\n&lt;p&gt;So what is the problem with these plots? First, if you have a print edition of your journal you are wasting ink. No need to waste all that toner just to show these four summaries:&lt;/p&gt;\r\n&lt;pre&gt;&lt;code&gt;##          x average  se\r\n## 1 Controls      60 2.3\r\n## 2    Cases      81 9.7&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;From these numbers you compute the p-value, which in this case is just below 0.05.&lt;/p&gt;\r\n&lt;p&gt;Second, the dynamite plot makes it appear as if there is a clear difference between the two groups. &lt;strong&gt;Showing the data&lt;/strong&gt; reveals more information. In our example, showing the data reveals that the lowest blood pressure is actually in the treatment group. It also reveals the presence of one somewhat extreme value of 150. This might represent a data entry mistake. Perhaps systolic pressure was recorded by accident? Note that without that data point, the difference is no longer significant at the 0.05 level.&lt;/p&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-02-21-dynamite-plots-must-die_files/figure-html/show-data-1.png\" width=\"768\" /&gt;&lt;/p&gt;\r\n&lt;p&gt;Note also that, as pointed out by Weissgerber, data that look quite different can result in exactly the same barplot. For instance, the two datasets below would produce the same barplot as the one shown above.&lt;/p&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-02-21-dynamite-plots-must-die_files/figure-html/different-data-same-dynamite-1.png\" width=\"768\" /&gt;&lt;/p&gt;\r\n&lt;/div&gt;\r\n&lt;div id=\"what-should-we-do-instead\" class=\"section level2\"&gt;\r\n&lt;h2&gt;What should we do instead?&lt;/h2&gt;\r\n&lt;p&gt;First, let’s generate the data that we will use in the example R code shown below.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;library(tidyverse)\r\nset.seed(0)\r\nn &amp;lt;- 10\r\ncases &amp;lt;- rnorm(n, log2(64), 0.25)\r\ncontrols &amp;lt;- rnorm(n, log2(64), 0.25)\r\ncases &amp;lt;- 2^(cases)\r\ncontrols &amp;lt;- 2^(controls)\r\ncases[1:2] &amp;lt;- c(110, 150) #introduce outliers\r\ndat &amp;lt;- data.frame(x = factor(rep(c(&amp;quot;Controls&amp;quot;, &amp;quot;Cases&amp;quot;), each = n), \r\n                             levels = c(&amp;quot;Controls&amp;quot;, &amp;quot;Cases&amp;quot;)),\r\n                             Outcome = c(controls, cases))&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;One option is simply to show the data points, which you can do like this:&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;dat %&amp;gt;% ggplot(aes(x, Outcome)) + \r\n        geom_jitter(width = 0.05)&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-02-21-dynamite-plots-must-die_files/figure-html/just-points-1.png\" width=\"672\" /&gt;&lt;/p&gt;\r\n&lt;p&gt;In this case we see that the data is right skewed so we might want to remake the plot in the log scale&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;dat %&amp;gt;% ggplot(aes(x, Outcome)) + \r\n        geom_jitter(width = 0.05) + \r\n        scale_y_log10()&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-02-21-dynamite-plots-must-die_files/figure-html/just-points-log-1.png\" width=\"672\" /&gt;&lt;/p&gt;\r\n&lt;p&gt;If we want to show summary statistics for the data, we can superimpose a boxplot:&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;dat %&amp;gt;% ggplot(aes(x, Outcome)) + \r\n        geom_boxplot() +\r\n        geom_jitter(width = 0.05) + \r\n        scale_y_log10()&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-02-21-dynamite-plots-must-die_files/figure-html/points-and-boxplot-1.png\" width=\"672\" /&gt;&lt;/p&gt;\r\n&lt;p&gt;Although not the case here, if there are too many points, we can simply show the boxplot.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;dat %&amp;gt;% ggplot(aes(x, Outcome)) + \r\n        geom_boxplot() +\r\n        scale_y_log10()&lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-02-21-dynamite-plots-must-die_files/figure-html/just-boxplot-1.png\" width=\"672\" /&gt;&lt;/p&gt;\r\n&lt;p&gt;And if we are worried that five summary statistics might be hiding important characteristics of the data, we can use ridge plots.&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;library(ggridges)\r\ndat %&amp;gt;% ggplot(aes(Outcome, x)) + \r\n        scale_x_log10() +\r\n        geom_density_ridges(scale = 0.9) &lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-02-21-dynamite-plots-must-die_files/figure-html/ridge-plots-1.png\" width=\"672\" /&gt;&lt;/p&gt;\r\n&lt;p&gt;If of manageable size, you should show the data points as well:&lt;/p&gt;\r\n&lt;pre class=\"r\"&gt;&lt;code&gt;library(ggridges)\r\ndat %&amp;gt;% ggplot(aes(Outcome, x)) + \r\n        scale_x_log10() +\r\n        geom_density_ridges(scale = 0.9,\r\n                            jittered_points = TRUE, \r\n                            position = position_points_jitter(width = 0.05,\r\n                                                              height = 0),\r\n                            point_shape = &amp;#39;|&amp;#39;, point_size = 3, \r\n                            point_alpha = 1, alpha = 0.7) &lt;/code&gt;&lt;/pre&gt;\r\n&lt;p&gt;&lt;img src=\"https://simplystatistics.org/post/2019-02-21-dynamite-plots-must-die_files/figure-html/ridge-plots-with-data-1.png\" width=\"672\" /&gt;&lt;/p&gt;\r\n&lt;p&gt;For more recommendation and Excel templates please consult &lt;a href=\"https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002128\"&gt;Weissgerber et al.&lt;/a&gt; or &lt;a href=\"https://twitter.com/t_weissgerber/status/1087646461548998657?s=12\"&gt;this thread&lt;/a&gt;.&lt;/p&gt;\r\n&lt;/div&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/uhTy3UbMzZA\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/02/21/dynamite-plots-must-die/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>Interview with Stephanie Hicks</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/I2FRoEDCwVc/</link>\r\n      <pubDate>Mon, 18 Feb 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/02/18/interview-with-stephanie-hicks/</guid>\r\n      <description>&lt;p&gt;&lt;em&gt;Editor&amp;rsquo;s note: For a while we ran an interview series for statisticians and data scientists, but things have gotten a little hectic around here so we&amp;rsquo;ve dropped the ball! But we are re-introducing the series, starting with Stephanie Hicks. If you have recommendations of a (junior) person in academics or industry you would like to see promoted, reach out to Jeff (@jtleek) on Twitter!&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;Stephanie Hicks received her PhD in statistics in 2013 at Rice University and has already made major contributions to the analysis of single cell sequencing data and the theory and practice of teaching data science.&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: Do you consider yourself a statistician, biostatistician, data scientist or something else?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;SH: Fantastic question! I’m a statistician by training, and I work in a department of biostatistics, so I would be remiss if I didn’t answer a statistician. However my interests are at the intersection of statistics, data science, genomics and education. Broadly, my research interests are to leverage statistical methods and computational algorithms to effectively derive knowledge from data. I’m also very interested in identifying better ways to teach students how to do that. I work a lot with genomics data, but I also analyze data from many other areas. You might think of this as data science, so I could easily imagine someone classifying myself as an ‘academic data scientist’ if such a thing exists?&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: How did you end up at Johns Hopkins (i.e. your history)?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;SH: I received my B.S. in Mathematics in 2007 rom LSU and my M.A. and Ph.D. in 2013 from the Department of Statistics at Rice University under the direction of Marek Kimmel and Sharon Plon (@splon). I completed my postdoctoral training with Rafael Irizarry (@rafalab) in the Department of Data Sciences at Dana-Farber Cancer Institute and Department of Biostatistics at Harvard T.H. Chan School of Public Health. While I was a postdoc, I had the opportunity to meet many students from the Johns Hopkins Biostatistics Department at the Women in Statistics Conference in Cary, North Carolina in 2014. The following year, I attended the ROpenSci Unconference and teamed up with Roger Peng, Hilary Parker and David Robinson to work on the explainr and catsplainr R packages. Given this department been a pioneer in developing statistical methods for the analysis of genomics data and in data science education, I couldn’t think of a better department to join.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: What are the problems that most excite you right now?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;SH: Within the world of genomics, I’m most excited about open challenges and problems in what’s often referred to as “single-cell genomics”. There is a great potential for using this data that measure features or traits (such as measuring what genes are expressed) in individual cells to help find better diagnoses, prognoses and treatments for many diseases. There are not only open statistical challenges to this data, but also computational challenges. For example, datasets being generated are frequently so large that they cannot even be read into memory. Therefore, one of my projects is to make unsupervised learning methods, such as k-means, scalable to millions of observations (or cells) by combining on-disk data representations (such as HDF5 files) and performing computations on small, random batches of observations so they can be stored in memory and analyzed in a scalable manner.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: Are you working on any non-research data science projects you are excited about? What are they?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;SH: So many! Let’s see, I just focus on one, but happy to talk about more.&lt;/p&gt;\r\n\r\n&lt;p&gt;If you close your eyes and say the word ’teacher’, ’doctor or nurse’, ‘pilot’, ‘chef’: these are all jobs that most people, including children, can easily visualize and conceptualize. I have two young children who easily get inspired from the books that we read. Last year, I went to look for children’s book featuring women in statistics or data science and couldn’t find any. To address this, I’m working with a team of awesome individuals to create a children’s book featuring women in statistics and data science. My goal with this book is to allow for young children to visualize and conceptualize what it means to be a statistician or data scientist. And even more importantly, highlight women to have been trailblazers in these fields, so little girls reading this book may one day be inspired to learn more about statistics and data science and may even choose this career.&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: What do you see as the most exciting ways to incorporate data science into academia?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;Within the world of data science, I’m super interested in finding new ways to teach students how to analyze data in a format that is efficient, effective and scalable. There is a ton of exciting curriculum development happening across the world, and I hope to contribute to that as well. For example, I’m interested in teaching students how to analyze data using the case-study based approach that was pioneered by Deb Nolan and Terry Speed in 1999. My postdoc advisor and I wrote a guide for teaching data science courses based on our teaching experiences. This fall, Roger Peng and I used this approach to teach an Introduction to Data Science course offered in our department at Hopkins (&lt;a href=\"https://jhu-advdatasci.github.io/2018/\"&gt;https://jhu-advdatasci.github.io/2018/&lt;/a&gt;).&lt;/p&gt;\r\n\r\n&lt;p&gt;&lt;em&gt;SS: You started R ladies Baltimore, what is it about that community that inspired you to create the Baltimore branch?&lt;/em&gt;&lt;/p&gt;\r\n\r\n&lt;p&gt;R-Ladies is an amazing global organization that has a goal of achieving proportionate representation by encouraging, inspiring, and empowering people of genders currently underrepresented in the R community. It was started by Gabriela de Queiroz in San Francisco in October 2012 who wanted to give back to her local community after going to several meetups and learning a lot for free, but saw disparity in gender diversity. Since then R-Ladies has grown to 138 chapters in 44 countries and 39000 members. It is also now funded by the R-Consortium.&lt;/p&gt;\r\n\r\n&lt;p&gt;As an active user and developer of R packages, I know first hand how intimidating it can be to get started using R as both a new user and a gender minority in the R community. I started R-Ladies Baltimore (&lt;a href=\"https://rladies-baltimore.github.io\"&gt;https://rladies-baltimore.github.io&lt;/a&gt;) to create a community in Baltimore for underrepresented minorities to grow their knowledge, experience and skills in R, to confidently ask questions, to learn and support each other, and to contribute new R code and packages for their own projects. We have meetups every two months and try to think creatively about how we can inspire our attendees to learn and use R. One of my favorite meetups was in the fall where we met to make holiday card designs in R (&lt;a href=\"https://rladies-baltimore.github.io/post/making-holiday-cards-in-r-2018/\"&gt;https://rladies-baltimore.github.io/post/making-holiday-cards-in-r-2018/&lt;/a&gt;).&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/I2FRoEDCwVc\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/02/18/interview-with-stephanie-hicks/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>The Tentpoles of Data Science</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/2i0Ll4j5gnQ/</link>\r\n      <pubDate>Fri, 18 Jan 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/01/18/the-tentpoles-of-data-science/</guid>\r\n      <description>&lt;p&gt;What makes for a good data scientist? This is a question I asked &lt;a href=\"https://simplystatistics.org/2012/05/07/how-do-you-know-if-someone-is-great-at-data-analysis/\"&gt;a long time ago&lt;/a&gt; and am still trying to figure out the answer. Seven years ago, I wrote:&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;I was thinking about the people who I think are really good at data analysis and it occurred to me that they were all people I knew. So I started thinking about people that I don’t know (and there are many) but are equally good at data analysis. This turned out to be much harder than I thought. And I’m sure it’s not because they don’t exist, it’s just because I think good data analysis chops are hard to evaluate from afar using the standard methods by which we evaluate people.&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;Now that time has passed and I’ve had an opportunity to see what’s going on in the world of data science, what I think about good data scientists, and what seems to make for good data analysis, I have a few more ideas on what makes for a good data scientist. In particular, I think there are broadly five “tentpoles” for a good data scientist. Each tentpole represents a major area of activity that will to some extent be applied in any given data analysis.&lt;/p&gt;\r\n\r\n&lt;p&gt;When I ask myself the question “What is data science?” I tend to think of the following five components. Data science is&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;the application of &lt;strong&gt;design thinking&lt;/strong&gt; to data problems;&lt;/li&gt;\r\n&lt;li&gt;the creation and management of &lt;strong&gt;workflows&lt;/strong&gt; for transforming and processing data;&lt;/li&gt;\r\n&lt;li&gt;the negotiation of &lt;strong&gt;human relationships&lt;/strong&gt; to identify context, allocate resources, and characterize audiences for data analysis products;&lt;/li&gt;\r\n&lt;li&gt;the application of &lt;strong&gt;statistical methods&lt;/strong&gt; to quantify evidence; and&lt;/li&gt;\r\n&lt;li&gt;the transformation of data analytic information into coherent &lt;strong&gt;narratives and stories&lt;/strong&gt;&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;My contention is that if you are a good data scientist, then you are good at all five of the tentpoles of data science. Conversely, if you are good at all five tentpoles, then you’ll likely be a good data scientist.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"design-thinking\"&gt;Design Thinking&lt;/h2&gt;\r\n\r\n&lt;p&gt;Listeners of &lt;a href=\"http://www.nssdeviations.com\"&gt;my podcast&lt;/a&gt; know that Hilary Parker and I are fans of design thinking. Having recently spent eight episodes discussing Nigel Cross’s book &lt;a href=\"https://www.amazon.com/Design-Thinking-Understanding-Designers-Think/dp/1847886361\"&gt;&lt;em&gt;Design Thinking&lt;/em&gt;&lt;/a&gt;, it’s clear I think this is a major component of good data analysis.&lt;/p&gt;\r\n\r\n&lt;p&gt;The main focus here is developing a proper framing of a problem and homing in on the most appropriate question to ask. Many good data scientists are distinguished by their ability to think of a problem in a new way. Figuring out the best way to ask a question requires knowledge and consideration of the audience and what it is they need. I think it’s also important to frame the problem in a way that is personally interesting (if possible) so that you, as the analyst, are encouraged to look at the data analysis as a systems problem. This requires digging into all the details and looking into areas that others who are less interested might overlook. Finally, alternating between &lt;a href=\"https://simplystatistics.org/2018/09/14/divergent-and-convergent-phases-of-data-analysis/\"&gt;divergent and convergent thinking&lt;/a&gt; is useful for exploring the problem space via potential solutions (rough sketches), but also synthesizing many ideas and bringing oneself to focus on a specific question.&lt;/p&gt;\r\n\r\n&lt;p&gt;Another important area that design thinking touches is the solicitation of &lt;em&gt;domain knowledge&lt;/em&gt;. &lt;a href=\"https://simplystatistics.org/2018/11/01/the-role-of-academia-in-data-science-education/\"&gt;Many would argue&lt;/a&gt; that having domain knowledge is a key part of developing a good data science solution. But I don’t think being a good data scientist is &lt;em&gt;about&lt;/em&gt; having specific knowledge of biology, web site traffic, environmental health, or clothing styles. Rather, if you want to have an impact in any of those areas, it’s important to be able to &lt;em&gt;solicit&lt;/em&gt; the relevant information&amp;mdash;including domain knowledge&amp;mdash;for solving the problem at hand. I don’t have a PhD in environmental health sciences, and my knowledge of that area is not at the level of someone who does. But I believe that over my career, I have solicited the relevant information from experts and have learned the key facts that are needed to conduct data science research in this area.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"workflows\"&gt;Workflows&lt;/h2&gt;\r\n\r\n&lt;p&gt;Over the past 15 years or so, there has been a growing discussion of the importance of good workflows in the data analysis community. At this point, I’d say a critical job of a data scientist is to develop and manage the workflows for a given data problem. Most likely, it is the data scientist who will be in a position to observe how the data flows through a team or across different pieces of software, and so the data scientist will know how best to manage these transitions. If a data science problem is a &lt;em&gt;systems problem&lt;/em&gt;, then the workflow indicates how different pieces of the system talk to each other. While the tools of data analytic workflow management are constantly changing, the importance of the idea persists and staying up-to-date with the best tools is a key part of the job.&lt;/p&gt;\r\n\r\n&lt;p&gt;In the scientific arena the end goal of good workflow management is often reproducibility of the scientific analysis. But good workflow can also be critical for collaboration, team management, and producing &lt;em&gt;good&lt;/em&gt; science (as opposed to merely reproducible science). Having a good workflow can also facilitate sharing of data or results, whether it’s with another team at the company or with the public more generally, as in the case of scientific results. Finally, being able to understand and communicate how a given result has been generated through the workflow can be of great importance when problems occur and need to be debugged.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"human-relationships\"&gt;Human Relationships&lt;/h2&gt;\r\n\r\n&lt;p&gt;In previous posts I’ve discussed the importance of &lt;a href=\"https://simplystatistics.org/2018/05/24/context-compatibility-in-data-analysis/\"&gt;context&lt;/a&gt;, &lt;a href=\"https://simplystatistics.org/2018/06/18/the-role-of-resources-in-data-analysis/\"&gt;resources&lt;/a&gt;, and &lt;a href=\"https://simplystatistics.org/2018/04/17/what-is-a-successful-data-analysis/\"&gt;audience&lt;/a&gt; for producing a successful data analysis. Being able to grasp all of these things typically involves having &lt;a href=\"https://simplystatistics.org/2018/04/30/relationships-in-data-analysis/\"&gt;good relationships&lt;/a&gt; with other people, either within a data science team or outside it. In my experience, poor relationships can often lead to poor work.&lt;/p&gt;\r\n\r\n&lt;p&gt;It’s a rare situation where a data scientist works completely alone, accountable to no one, only presenting to themselves. Usually, resources must be obtained to do the analysis in the first place and the audience (i.e. users, customers, viewers, scientists) must be characterized to understand how a problem should be framed or a question should be asked. All of this will require having relationships with people who can provide the resources or the information that a data scientist needs.&lt;/p&gt;\r\n\r\n&lt;p&gt;Failures in data analysis can often be traced back to a breakdown in human relationships and in communication between team members. As the &lt;a href=\"https://simplystatistics.org/2018/04/23/what-can-we-learn-from-data-analysis-failures/\"&gt;Duke Saga&lt;/a&gt; showed us, dramatic failures do not occur because someone didn’t know what a &lt;em&gt;p&lt;/em&gt;-value was or how to fit a linear regression. In that particular case, knowledgeable people reviewed the analysis, identified exactly all the serious the problems, raised the issues with the right people, and&amp;hellip;were ignored. There is no statistical method that I know of that can prevent disaster from occurring under this circumstance. Unfortunately, for outside observers, it’s usually impossible to see this process happening, and so we tend to attribute failures to the parts that we &lt;em&gt;can&lt;/em&gt; see.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"statistical-methods\"&gt;Statistical Methods&lt;/h2&gt;\r\n\r\n&lt;p&gt;Applying statistical methods is obviously essential to the job of a data scientist. In particular, knowing what methods are most appropriate for different situations and different kinds of data, and which methods are best-suited to answer different kinds of questions. Proper application of statistical methods is clearly important to doing &lt;em&gt;good&lt;/em&gt; data analysis, but it’s also important for data scientists to know what methods can be reasonably applied given the constraint on resources. If an analysis must be done by tomorrow, one cannot apply a method that requires two days to complete. However, if the method that requires two days is the &lt;em&gt;only&lt;/em&gt; appropriate method, then additional time or resources must be negotiated (thus necessitating good relationships with others).&lt;/p&gt;\r\n\r\n&lt;p&gt;I don’t think much more needs to be said here as I think most assume that knowledge of statistical methods is critical to being a good data scientist. That said, one important aspect that falls into this category is the &lt;em&gt;implementation&lt;/em&gt; of statistical methods, which can be more or less complex depending on the size of the data. Sophisticated computational algorithms and methods may need to be applied or developed from scratch if a problem is too big to work on off-the-shelf software. In such cases, a good data scientist will need to know how to implement these methods so that the problem can be solved. While it is sometimes necessary to collaborate with an expert in this area who can implement a complex algorithm, this creates a new layer of communication and another relationship that must be properly managed.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"narratives-and-stories\"&gt;Narratives and Stories&lt;/h2&gt;\r\n\r\n&lt;p&gt;Even the simplest of analyses can produce an overwhelming amount of results and being able to distill that information into a coherent narrative or story is critical to the success of an analysis. If a great analysis is done, but no one can understand it, did it really happen? Narratives and stories serve as &lt;em&gt;dimension reduction for results&lt;/em&gt; and allow an audience to navigate a specified path through the sea of information.&lt;/p&gt;\r\n\r\n&lt;p&gt;Data scientists have to prioritize what is important and what is not and present things that are relevant to the audience. Part of building a good narrative is choosing the right presentation materials to tell the story, whether they be plots, tables, charts, or text. There is rarely an optimal choice that serves all situations because what works best will be highly audience- and context-dependent. Data scientists need to be able to “read the room”, so to speak, and make the appropriate choices. Many times, when I’ve seen critiques of data analyses, it’s not the analysis that is being criticized but rather the choice of narrative. If the data scientist chooses to emphasize one aspect but the audience thinks another aspect is more important, the analysis will seem &amp;ldquo;wrong&amp;rdquo; even though the application of the methods to the data is correct.&lt;/p&gt;\r\n\r\n&lt;p&gt;A hallmark of good communication about a data analysis is providing a way for the audience to &lt;a href=\"https://simplystatistics.org/2017/11/16/reasoning-about-data/\"&gt;reason about&lt;/a&gt; &lt;a href=\"https://simplystatistics.org/2017/11/20/follow-up-on-reasoning-about-data/\"&gt;the data&lt;/a&gt; and to understand how the data are tied to the result. This is a &lt;em&gt;data analysis&lt;/em&gt; after all, and we should be able to see for ourselves how the data inform the conclusion. As an audience member in this situation, I’m not as interested in just trusting the presenter and their conclusions.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"describing-a-good-data-scientist\"&gt;Describing a Good Data Scientist&lt;/h2&gt;\r\n\r\n&lt;p&gt;When thinking of some of the best data scientists I’ve known over the years, I think they are all good at the five tentpoles I’ve described above. However, what about the converse? If you met someone who demonstrated that they were good at these five tentpoles, would you think they were a good data scientist? I think the answer is yes, and to get a sense of this, one need look no further than a typical job advertisement for a data science position.&lt;/p&gt;\r\n\r\n&lt;p&gt;I recently saw this &lt;a href=\"https://fertiglab.com/opportunities\"&gt;job ad&lt;/a&gt; from my Johns Hopkins colleague Elana Fertig. She works in the area of computational biology and her work involves analyzing large quantities of data to draw connections between people’s genes and cancer (if I may make a gross oversimplification). She is looking for a postdoctoral fellow to join her lab and the requirements listed for the position are typical of many ads of this type:&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;PhD in computational biology, biostatistics, biomedical engineering, applied mathematics, or a related field.&lt;/li&gt;\r\n&lt;li&gt;Proficiency in programming with R/Bioconductor and/or python for genomics analysis.&lt;/li&gt;\r\n&lt;li&gt;Experience with high-performance computing clusters and LINUX scripting.&lt;/li&gt;\r\n&lt;li&gt;Techniques for reproducible research and version control, including but not limited to experience generating knitr reports, GitHub repositories, and R package development.&lt;/li&gt;\r\n&lt;li&gt;Problem-solving skills and independence.&lt;/li&gt;\r\n&lt;li&gt;The ability to work as part of a multidisciplinary team.&lt;/li&gt;\r\n&lt;li&gt;Excellent written and verbal communication skills.&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;This is a job where complex &lt;strong&gt;statistical methods&lt;/strong&gt; will be applied to large biological datasets. As a result, knowledge of the methods or the biology will be useful, and knowing how to implement these methods on a large scale (i.e. via cluster computing) will be important. Knowing techniques for reproducible research requires knowledge of the proper &lt;strong&gt;workflows&lt;/strong&gt; and how to manage them throughout an analysis. Problem-solving skills is practically synonymous with &lt;strong&gt;design thinking&lt;/strong&gt;; working as part of a multidisciplinary team requires negotiating &lt;strong&gt;human relationships&lt;/strong&gt;; and developing &lt;strong&gt;narratives and stories&lt;/strong&gt; requires excellent written and verbal communication skills.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"summary\"&gt;Summary&lt;/h2&gt;\r\n\r\n&lt;p&gt;A good data scientist can be hard to find, and part of the reason is because being a good data scientist requires mastering skills in a wide range of areas. However, these five tentpoles are not haphazardly chosen; rather they reflect the interwoven set of skills that are needed to solve complex data problems. Focusing on being good at these five tentpoles means sacrificing time spent studying other things. To the extent that we can coalesce around the idea of convincing people to do exactly that, data science will become a distinct field with its own identity and vision.&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/2i0Ll4j5gnQ\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/01/18/the-tentpoles-of-data-science/</feedburner:origLink></item>\r\n    \r\n    <item>\r\n      <title>How Data Scientists Think - A Mini Case Study</title>\r\n      <link>http://feedproxy.google.com/~r/SimplyStatistics/~3/4EbVtQc0tsA/</link>\r\n      <pubDate>Wed, 09 Jan 2019 00:00:00 +0000</pubDate>\r\n      \r\n      <guid isPermaLink=\"false\">https://simplystatistics.org/2019/01/09/how-data-scientists-think-a-mini-case-study/</guid>\r\n      <description>&lt;p&gt;In episode 71 of &lt;a href=\"http://nssdeviations.com/\"&gt;Not So Standard Deviations&lt;/a&gt;, Hilary Parker and I inaugurated our first “Data Science Design Challenge” segment where we discussed how we would solve a given problem using data science. The idea with calling it a “design challenge” was to contrast it with common “hackathon” type models where you are presented with an already-collected dataset and then challenged to find something interesting in the data. Here, we wanted to start with a problem and then talk about how data might be collected and analyzed to address the problem. While both approaches might result in the same end-product, they address the various problems you encounter in a data analysis in a different order.&lt;/p&gt;\r\n\r\n&lt;p&gt;In this post, I want to break down our discussion of the challenge and highlight some of the issues that were discussed in framing the problem and in designing the data collection and analysis. I’ll end with some thoughts about generalizing this approach to other problems.&lt;/p&gt;\r\n\r\n&lt;p&gt;You can &lt;a href=\"https://www.dropbox.com/s/yajgbr25dbh20i0/NSSD%20Episode%2071%20Design%20Challenge.mp3?dl=0\"&gt;download an MP3 of this segment of the episode&lt;/a&gt; (it is about 45 minutes long) or you can &lt;a href=\"https://drive.google.com/open?id=11dEhj-eoh8w13dS-mWvDMv7NKWXZcXMr\"&gt;read the transcript of the segment&lt;/a&gt;. If you’d prefer to stream the segment you can &lt;a href=\"https://overcast.fm/+FMBuKdMEI/00:30\"&gt;start listening here&lt;/a&gt;.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"the-brief\"&gt;The Brief&lt;/h2&gt;\r\n\r\n&lt;p&gt;The general goal was to learn more about the time it takes for each of us to commute to work. Hilary lives in San Francisco and I live in Baltimore, so the characteristics of our commutes are very different. She walks and takes public transit; I drive most days. We also wanted to discuss how we might collect data on our commute times in a systematic, but not intrusive, manner. When we originally discussed having this segment, this vague description was about the level of specification that we started with, so an initial major task was to&lt;/p&gt;\r\n\r\n&lt;ol&gt;\r\n&lt;li&gt;Develop a better understanding of what question each of us was trying to answer;&lt;/li&gt;\r\n&lt;li&gt;Frame the problem in a manner that could be translated into a data collection task; and&lt;/li&gt;\r\n&lt;li&gt;Sketch out a feasible statistical analysis.&lt;/li&gt;\r\n&lt;/ol&gt;\r\n\r\n&lt;h2 id=\"framing-the-problem\"&gt;Framing the Problem&lt;/h2&gt;\r\n\r\n&lt;p&gt;Hilary and I go through a few rounds of discussion on the topic of how to think about this problem and the questions that we’re trying to answer. Early in the discussion Hilary mentions that this problem was “pressing on my mind” and that she took a particular interest in seeing the data and acting on it. Her intense interest in the problem potentially drove part of her creativity in developing solutions.&lt;/p&gt;\r\n\r\n&lt;p&gt;Hilary initially mentions that the goal is to understand the variation in commute times (i.e. estimate the variance), but then quickly shifts to the problem of estimating average commute times for the two different commute methods that she uses.&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;HILARY: you&amp;hellip;maybe only have one commute method and you want to understand the variance of that. So&amp;hellip;what range of times does it usually take for me to get to work, or&amp;hellip;I have two alternatives for my commute methods. So it might be like how long does it take me in this way versus that way? And for me the motivation is that I want to make sure I know when to leave so that I make it to meetings on time&amp;hellip;.&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;In her mind, the question being answered by this data collection is, “When should I leave home to get to meetings on time?” At this point she mentions two possible ways to think about addressing this question.&lt;/p&gt;\r\n\r\n&lt;ol&gt;\r\n&lt;li&gt;Estimate the variability of commute times and leave the house accordingly; or&lt;/li&gt;\r\n&lt;li&gt;Compare two different commute methods and then &lt;em&gt;choose&lt;/em&gt; a method on any given day.&lt;/li&gt;\r\n&lt;/ol&gt;\r\n\r\n&lt;p&gt;Right off the bat, Hilary notes that she doesn’t actually do this commute as often as she’d thought. Between working from home, taking care of chores in the morning, making stops on the way, and walking/talking with friends, a lot of variation can be introduced in to the data.&lt;/p&gt;\r\n\r\n&lt;p&gt;I mention that “going to work” and “going home”, while both can be thought of as commutes, are not the same thing and that we might be interested in one more than the other. Hilary agrees that they are different problems but they are both potentially of interest.&lt;/p&gt;\r\n\r\n&lt;h3 id=\"question-intervention-duality\"&gt;Question/Intervention Duality&lt;/h3&gt;\r\n\r\n&lt;p&gt;At this point I mention that my commuting is also affected by various other factors and that on different days of the week, I have a different commute pattern. On days where I drop my son off at school, I have less control over when I leave compared to days when I drive straight to work. Here, we realize a fundamental issue, which is that different days of the week indicate somewhat different interventions to take:&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;On days where I drive straight to work, the question is “When should I leave to arrive on time for the first meeting?”&lt;/li&gt;\r\n&lt;li&gt;On days where I drop my son off at school, the question is “When is the earliest time that I can schedule the first meeting of the day?”&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;In the former situation I have control over, and could potentially intervene on, when I leave the house, whereas in the latter situation I have control over when I schedule the first meeting. While these are distinct questions with different implications, at this point they may both require collecting travel time data in the same manner.&lt;/p&gt;\r\n\r\n&lt;p&gt;Earlier in this section I mention that on days when I drop my son off at school it can take 45 minutes to get to work. Hilary challenges this observation and mentions that “Baltimore is not that big”. She makes use of her knowledge of Baltimore geography to suggest that this is unexpected. However, I mention that the need to use local roads exclusively for this particular commute route makes it indeed take longer than one might expect.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"designing-the-data-collection-system\"&gt;Designing the Data Collection System&lt;/h2&gt;\r\n\r\n&lt;p&gt;In discussing the design of her data collection system, Hilary first mentions that a podcast listener had emailed in and mentioned his use of Google Maps to predict travel times based on phone location data. While this seemed like a reasonable idea, it ultimately was not the direction she took.&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;HILARY: At first I was thinking about that because I have location history on and I look at it a lot, but there’s also a fair degree of uncertainty there. Like sometimes it just puts me in these really weird spots or&amp;hellip;I lose GPS signal when I go underground and also I do not know how to get data in an API sense from that. So I knew it would be manual data. In order to analyze the data, I would have to go back and be like let me go and collect the data from this measurement device. So I was trying to figure out what I could use instead.&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;Then she describes how she can use Wi-Fi connections (and dis-connections) to serve as surrogates for leaving and arriving.&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;And at some point I realized that two things that reliably happen every time I do a commute is that my phone disconnects from my home Wi-Fi. And then it reconnects to my work Wi-Fi. And so I spent some time trying to figure out if I could log that information, like if there’s an app that logged that, and there is not. But, there is a program called If This, Then That, or an app. And so with that you can say “when my phone disconnects from Wi-Fi do something”, and you can set it to a specific Wi-Fi. So that was exciting.&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;Other problems that needed solving were:&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;&lt;em&gt;Where to store the data&lt;/em&gt;. Hilary mentions that a colleague was using &lt;a href=\"https://airtable.com\"&gt;Airtable&lt;/a&gt; (a kind of cloud-based spreadsheet/database) and decided to give it a try.&lt;/li&gt;\r\n&lt;li&gt;&lt;em&gt;Indicating commute method&lt;/em&gt;. Hilary created a system where she could send a text message containing a keyword about her commute method to a service that would then log the information to the table collecting the travel time data.&lt;/li&gt;\r\n&lt;li&gt;&lt;em&gt;Multiple Wi-Fi connects&lt;/em&gt;. Because her phone was constantly connecting and disconnecting from Wi-Fi at work, she had to define the “first connection to Wi-Fi at work” as meaning that she had arrived at work.&lt;/li&gt;\r\n&lt;li&gt;&lt;em&gt;Sensing a Wi-Fi disconnect&lt;/em&gt;. Hilary’s phone had to be “awake” in order to sense a Wi-Fi disconnect, which was generally the case, but not always. There was no way to force her phone to always be awake, but she knew that the system would send her a push notification when it had been triggered. Therefore, she would at least know that if she didn’t receive a push notification, then something had gone wrong.&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;Hilary mentions that much of the up front effort is important in order to avoid messy data manipulations later on.&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;HILARY: I think I’ll end up&amp;mdash;I did not do the analysis yet but I’ll end up having to scrub the data. So I was trying to avoid manual data scrubbing, but I think I’m going to have to do it anyway.&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;Ultimately, it is impossible to avoid all data manipulation problems.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"specifying-the-data\"&gt;Specifying the Data&lt;/h2&gt;\r\n\r\n&lt;p&gt;What exactly are the data that we will be collecting? What are the covariates that we need to help us understand and model the commute times? Obvious candidates are&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;The start time for the commute (date/time format, but see below)&lt;/li&gt;\r\n&lt;li&gt;The end time for the commute (date/time)&lt;/li&gt;\r\n&lt;li&gt;Indicator of whether we are going to work or going home (categorical)&lt;/li&gt;\r\n&lt;li&gt;Commute method (categorical)&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;Hilary notes that from the start/end times we can get things like day of the week and time of day (e.g. via the &lt;a href=\"https://cran.r-project.org/web/packages/lubridate/index.html\"&gt;lubridate&lt;/a&gt; package). She also notes that her system doesn’t exactly produce date/time data, but rather a text sentence that includes the date/time embedded within. Thankfully, that can be systematically dealt with using simple string processing functions.&lt;/p&gt;\r\n\r\n&lt;p&gt;A question arises about whether a separate variable should be created to capture “special circumstances” while commuting. In the data analysis, we may want to exclude days where we know something special happened to make the commute much longer than we might have expected (e.g. we happened to see a friend along the way or we decided to stop at Walgreens). The question here is&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;Are these special circumstances part of the natural variation in the commute time that we want to capture, or are they “one-time events” that are in some sense predictable?&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;A more statistical way of asking the question might be, do these special circumstances represent &lt;a href=\"https://simplystatistics.org/2018/07/23/partitioning-the-variation-in-data/\"&gt;&lt;em&gt;fixed or random variation&lt;/em&gt;&lt;/a&gt;? If they are random and essentially uncontrollable events, then we would want to include that in the random portion of any model. However, if they are predictable (and perhaps controllable) events, then we might want to think of them as another covariate.&lt;/p&gt;\r\n\r\n&lt;p&gt;While Hilary believes that she ultimately &lt;em&gt;does&lt;/em&gt; have control over whether these time-consuming detours occur or not, she decides to model them as essentially random variation and that these events should be lumped in with the natural variation in the data.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"specifying-the-treatment\"&gt;Specifying the Treatment&lt;/h2&gt;\r\n\r\n&lt;p&gt;At this point in the discussion there is a question regarding what &lt;em&gt;effect&lt;/em&gt; we are trying to learn about. The issue is that sometimes changes to a commute have to be made on the fly to respond to unexpected events. For example, if the public transportation system breaks down, you might have to go on foot.&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;ROGER: Well it becomes like a compliance situation, right? Like you can say, do you want to know how long does it take when you take MUNI or how long does it take when you intend to take MUNI?&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;In this section I mention that it’s like a “compliance problem”. In clinical trials, for example when testing a new drug versus a placebo, it is possible to have a situation where people in the treatment group of the study are given the new drug but do not actually take it. Maybe the drug has side effects or is inconvenient to take. Whatever the reason, they are not complying with the protocol of the study, which states that everyone in the treatment group takes the new drug. The question then is whether you want to use the data from the clinical trial to understand the actual effect of the drug or if you want to understand the effect of telling someone to take the drug. The latter effect is often referred to as the &lt;em&gt;intention to treat&lt;/em&gt; effect while the former is sometimes called the &lt;em&gt;complier average effect&lt;/em&gt;. Both are valid effects to estimate and have different implications in terms of next steps.&lt;/p&gt;\r\n\r\n&lt;p&gt;In the context of Hilary’s problem, we want to estimate the average commute time for each commute method. However, what happens if Muni experiences some failure that requires altering the commute method? The potential “compliance issue” here is whether Muni works properly or not. If it does not, then Hilary may take some alternate route to work, even though she &lt;em&gt;intended&lt;/em&gt; to take Muni. Whether Muni works properly or not is a kind of “post-treatment variable” because it’s not under the direct control of Hilary and its outcome is only known &lt;em&gt;after&lt;/em&gt; she decides on which commute method she is going to take (i.e. the “treatment”). Now a choice must be made: Do we estimate the average commute time when taking Muni or the average commute time when she &lt;em&gt;intends&lt;/em&gt; to take Muni, even if she has to divert to an alternate route?&lt;/p&gt;\r\n\r\n&lt;p&gt;Hilary and I both seem to agree that the intention to treat effect is the one we want to estimate in the commute time problem. The reason is that the estimation of this effect has direct implications for the thing that we have control over: choosing which commute method to use. While it might be interesting from a scientific perspective to know the average commute time when taking Muni, regardless of whether we intended to take it or now, we have no control over the operation of Muni on any given day.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"starting-from-the-end\"&gt;Starting from the End&lt;/h2&gt;\r\n\r\n&lt;p&gt;I ask Hilary, suppose we have the data, what might we do with it? Specifically, suppose that we estimate for a given commute method that the average time is 20 minutes and the standard deviation is 5 minutes. What “intervention” would that lead us to take? What might we do differently from before when we had no systematically collected data?&lt;/p&gt;\r\n\r\n&lt;p&gt;Hilary answers by saying that we can designate a time to leave work based on the mean and standard deviation. For example, if we have to be at work at 9am we might leave at 8:35am (mean + 1 standard deviation) to ensure we’ll be arrive at 9am most of the time. In her answer, Hilary raises an important, but perhaps uncomfortable, consideration:&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;HILARY: I think in a completely crass world for example, I would choose different cutoffs based on the importance of a meeting. And I think people do this. So if you have a super important meeting, this is like a career-making meeting, you leave like an hour early&amp;hellip;. And so, there you’re like “I am going to do three standard deviations above the mean” so&amp;hellip;it’s very unlikely that I’ll show up outside of the time I predicted. But then if it’s like a touch-base with someone where you have a really strong relationship and they know that you value their time, then maybe you only do like one standard deviation.&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;Later I mention one implication for statistical modeling:&lt;/p&gt;\r\n\r\n&lt;blockquote&gt;\r\n&lt;p&gt;Roger: Well and I feel like&amp;hellip;the discussion of the distribution is interesting because it might come down to like, what do you think the tail of the distribution looks like? So what’s the worst case? Because if you want to minimize the worst case scenario, then you really, really need to know like what that tail looks like.&lt;/p&gt;\r\n&lt;/blockquote&gt;\r\n\r\n&lt;p&gt;Thinking about what the data will ultimately be used for raises two important statistical considerations:&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;We should think about the extremes/tails of the distribution and develop cutoffs that determine what time we should leave for work.&lt;/li&gt;\r\n&lt;li&gt;The cutoffs at the tail of the distribution might be dependent on the “importance” of the first meeting of the day, suggesting the existence of a cost function that quantifies the importance of arriving on time.&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;Hilary raises a hard truth, which is that not everyone gets the same consideration when it comes to showing up on time. For an important meeting, we might allow for “three standard deviations” more than the mean travel time to ensure some margin of safety for arriving on time. However, for a more routine meeting, we might just provide for one standard deviation of travel time and let natural variation take its course for better or for worse.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"statistical-modeling-considerations\"&gt;Statistical Modeling Considerations&lt;/h2&gt;\r\n\r\n&lt;p&gt;I mention that thinking about our imaginary data in terms of “mean” and “standard deviation” implies that the data have a distribution that is akin to a Normal distribution. However, given that the data will consist of travel times, which are always positive, a Normal distribution (which allows positive and negative numbers) may not be the most appropriate. Alternatives are the Gamma or the log-Normal distribution which are strictly positive. I mention that the log-Normal distribution allows for some fairly extreme events, to which Hilary responds that such behavior may in fact be appropriate for these data due to the near-catastrophic nature of Muni failures (San Francisco residents can feel free to chime in here).&lt;/p&gt;\r\n\r\n&lt;p&gt;From the previous discussion on what we might do with this data, it’s clear that the right tail of the distribution will be important in this analysis. We want to know what the “worst case scenario” might be in terms of total commute time. However, by its very nature, extreme data are rare, and so there will be very few data points that can be used to inform the shape of the distribution in this area (as opposed to the middle of the distribution where we will have many observations). Therefore, it’s likely that our choice of model (Gamma, log-Normal, etc.) will have a big influence on the predictions that we make about commute times in the future.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"study-design-considerations\"&gt;Study Design Considerations&lt;/h2&gt;\r\n\r\n&lt;p&gt;Towards the end I ask Hilary how much data is needed for this project? However, before asking I should have discussed the nature of the study itself:&lt;/p&gt;\r\n\r\n&lt;ul&gt;\r\n&lt;li&gt;Is it a fixed study designed to answer a specific question (i.e. what is the mean commute time?) within some bound of uncertainty? Or&lt;/li&gt;\r\n&lt;li&gt;Is it an ongoing study where data will be continuously collected and actions will be continuously adapted as new data are collected&lt;/li&gt;\r\n&lt;/ul&gt;\r\n\r\n&lt;p&gt;Hilary suggests that it is the latter and that she will simply collect data and make decisions as she goes. However, it’s clear that the time frame is not “forever” because the method of data collection is not zero cost. Therefore, at some point the costs of collecting data will likely be too great in light of any perceived benefit.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"discussion\"&gt;Discussion&lt;/h2&gt;\r\n\r\n&lt;p&gt;What have we learned from all of this? Most likely, the problem of estimating commute times is &lt;em&gt;not&lt;/em&gt; relevant to everybody. But I think there are aspects of the process described above that illustrate how the data analytic process works before data collection begins (yes, data analysis includes parts where there are no data). These aspects can be lifted from this particular example and generalized to other data analyses. In this section I will discuss some of these aspects and describe why they may be relevant to other analyses.&lt;/p&gt;\r\n\r\n&lt;h3 id=\"personal-interest-and-knowledge\"&gt;Personal Interest and Knowledge&lt;/h3&gt;\r\n\r\n&lt;p&gt;Hilary makes clear that she is very personally interested in this problem and in developing a solution. She wants to apply any knowledge learned from the data to her everyday life. In addition, she used her knowledge of Baltimore geography (from having lived there previously) to challenge my “mental data analysis”.&lt;/p&gt;\r\n\r\n&lt;p&gt;Taking a strong personal interest in a problem is not always an option, but it can be very useful. Part of the reason is that it can allow you to see the “whole problem” and all of its facets without much additional effort. An uninterested person can certainly learn all the facets of a problem, but it will seem more laborious. If you are genuinely interested in the subject of a problem, then you will be highly motivated to learn everything about that problem, which will likely benefit you in the data analysis. To the extent that data analysis is a &lt;em&gt;systems problem&lt;/em&gt; with many interacting parts, it helps to learn as much as possible about the system. Being &lt;em&gt;interested&lt;/em&gt; in knowing how the system works is a key advantage you can bring to any problem.&lt;/p&gt;\r\n\r\n&lt;p&gt;In my own teaching, I have found that students who are keenly interested in the problems they’re working on often do well in the data analysis. Partly, this is because they are more willing to dig into the nitty gritty of the data and modeling and to uncover small details that others may not find. Also, students with a strong interest often have strong expectations about what the data should show. If the data turn out to be different from what they are expecting, that surprise is often an important experience, sometimes even delightful. Students with a more distant relationship with the topic or the data can never be surprised because they have little in the way of expectations.&lt;/p&gt;\r\n\r\n&lt;h3 id=\"problem-framing\"&gt;Problem Framing&lt;/h3&gt;\r\n\r\n&lt;p&gt;From the discussion it seems clear that we are interested in both the characteristics of different commute methods and the variability associated with individual commute methods. Statistically, these are two separate problems that can be addressed through data collection and analysis. As part of trying to frame the problem, we iterate through a few different scenarios and questions.&lt;/p&gt;\r\n\r\n&lt;p&gt;One concept that we return to periodically in the discussion is the idea that every question has associated with it a potential intervention. So when I ask “What is the variability in my commute time”, a potential intervention is changing the time when I leave home. Another potential intervention is rescheduling my first meeting of the day. Thinking about questions in terms of their potential interventions can be very useful in prioritizing which questions are most interesting to ask. If the potential intervention associated with a question is something you do not have any control over, then maybe that question is not so interesting for you to ask. For example, if you do not control your own schedule at work, then “rescheduling the first meeting of the day” is not an option for you. However, you may still be able to control when you leave home.&lt;/p&gt;\r\n\r\n&lt;p&gt;With the question “How long does it take to commute by Muni?” one might characterize the potential intervention as “Taking Muni to work or not”. However, if Muni breaks down, then that is out of your control and you simply cannot take that choice. A more useful question then is “How long does it take to commute when I &lt;em&gt;choose&lt;/em&gt; to take Muni?” This difference may seem subtle, but it does imply a different analysis and is associated with a potential intervention that is completely controllable. I may not be able to take Muni everyday, but I can definitely &lt;em&gt;choose&lt;/em&gt; to take it everyday.&lt;/p&gt;\r\n\r\n&lt;p&gt;The last point I want to make here is related to the concept of taking a personal interest in a problem. If a data analysis problem can be framed in such a manner that it becomes more personally interesting, then perhaps that’s the route that should be taken, if at all possible. Often, we think that there is &amp;ldquo;one true way&amp;rdquo; to approach a problem or ask a question. But usually, there are a variety of different approaches and you should try to take the one that seems most interesting &lt;em&gt;you&lt;/em&gt;.&lt;/p&gt;\r\n\r\n&lt;h3 id=\"fixed-and-random-variation\"&gt;Fixed and Random Variation&lt;/h3&gt;\r\n\r\n&lt;p&gt;Deciding what is fixed variation and what is random is important at the design stage because it can have implications for data collection, data analysis, and the usefulness of the results. Sometimes this discussion can get very abstract, resulting in questions like &amp;ldquo;What is the meaning of ‘random’?”. It’s important not to get too bogged down in philosophical discussions (although the occasional one is fine). But it’s nevertheless useful to have such a discussion so that you can properly model the data later.&lt;/p&gt;\r\n\r\n&lt;p&gt;Classifying everything as &amp;ldquo;random&amp;rdquo; is a common crutch that people use because it gives you an excuse to not really collect much data. This is a cheap way to do things, but it also leads to data with a lot of variability, possibility to the point of not even being useful. For example, if we only collect data on commute times, and ignored the fact that we have multiple commute methods, then we might see a bimodal distribution in the commute time data. But that mysterious bi-modality could be explained by the different commute methods, a &lt;em&gt;fixed&lt;/em&gt; effect that is easily controlled. Taking the extra effort to track the commute method (for example, via Hilary’s text message approach) along with the commute time could dramatically reduce the residual variance in the data, making for more precise predictions.&lt;/p&gt;\r\n\r\n&lt;p&gt;That said, capturing every variable is often not feasible and so choices have to made. In this example, Hilary decided not to track whether she wandered into Walgreens or not because that event did have a random flavor to it. Practically speaking, it would be better to account for the fact that there may be an occasional random excursion into Walgreens rather than to attempt to control it every single time. This choice also simplifies the data collection system.&lt;/p&gt;\r\n\r\n&lt;h3 id=\"sketch-models\"&gt;Sketch Models&lt;/h3&gt;\r\n\r\n&lt;p&gt;When considering what to do with the data once we had it, it turned out that mitigating the worst case scenario was a key consideration. This translated directly into a statistical model that potentially had heavy tails. At this point, it wasn’t clear what that distribution would be, and it isn’t clear whether the data would be able to accurately inform the shape of the tail’s distribution.  That said, with this statistical model in mind we can keep an eye on the data as they come in and see how they shape up. Further, although we didn’t got through the exercise, it could be useful to estimate how many observations you might need in order to get a decent estimate of any model parameters. Such an exercise cannot really be done if you don’t have a specific model in mind.&lt;/p&gt;\r\n\r\n&lt;p&gt;In general, having a specific statistical model in mind is useful because it gives you a sense of &lt;em&gt;what to expect&lt;/em&gt;. If the data come in and look substantially different from the distribution that you originally considered, then that should lead you to ask &lt;em&gt;why do the data look different&lt;/em&gt;? Asking such a question may lead to interesting new details or uncover aspects of the data that hadn’t been considered before. For example, I originally thought the data could be modeled with a Gamma distribution. However, if the data came in and there were many long delays in Hilary’s commute, then her log-Normal distribution might seem more sensible. Her choice of that distribution from the beginning was informed by her knowledge of public transport in San Francisco, about which I know nothing.&lt;/p&gt;\r\n\r\n&lt;h2 id=\"summary\"&gt;Summary&lt;/h2&gt;\r\n\r\n&lt;p&gt;I have spoken with people who argue that are little in the way of generalizable concepts in data analysis because every data analysis is uniquely different from every other. However, I think this experience of observing myself talk with Hilary about this small example suggests to me that there are some general concepts. Things like gauging your personal interest in the problem could be useful in managing potential resources dedicated to an analysis, and I think considering fixed and random variation is important aspect of any data analytic design or analysis. Finally, developing a sketch (statistical) model before the data are in hand can be useful for setting expectations and for setting a benchmark for when to be surprised or skeptical.&lt;/p&gt;\r\n\r\n&lt;p&gt;One problem with learning data analysis is that we rarely, as students, get to observe the thought process that occurs at the early stages. In part, that is why I think many call for more experiential learning in data analysis, because the only way to see the process is to do the process. But I think we could invest more time and effort into recording some of these processes, even in somewhat artificial situations like this one,  in order to abstract out any generalizable concepts and advice. Such summaries and abstractions could serve as useful data analysis texts, allowing people to grasp the general concepts of analysis while using the time dedicated to experiential learning for studying the unique details of their problem.&lt;/p&gt;&lt;img src=\"http://feeds.feedburner.com/~r/SimplyStatistics/~4/4EbVtQc0tsA\" height=\"1\" width=\"1\" alt=\"\"/&gt;</description>\r\n    <feedburner:origLink>https://simplystatistics.org/2019/01/09/how-data-scientists-think-a-mini-case-study/</feedburner:origLink></item>\r\n    \r\n  </channel>\r\n</rss>\r\n"), 
    date = structure(1582560018, class = c("POSIXct", "POSIXt"
    ), tzone = "GMT"), times = c(redirect = 0, namelookup = 3.8e-05, 
    connect = 4e-05, pretransfer = 7.4e-05, starttransfer = 0.310586, 
    total = 0.311412)), class = "response")

Try the tidyRSS package in your browser

Any scripts or data that you put into this service are public.

tidyRSS documentation built on March 7, 2023, 5:23 p.m.