README.md

Suppose you are in a circadian genetics team with specialization in statistical computing.

The team leader gives you some challenges.

Surprisingly, they are all about COMMUNICATIONS.

The leader is interested in seeing how you are going to derive your solutions.

Challenge 1: how would your data communicate well with teammates so that your data can inspire them to choose wise and insightful scientific questions?

We may translate this question into "we face plenty of information even within one data resource. How do we know the direction to seize the right information?"

Here is a mini solution which takes the public FlyMet Database as an example. Highly recommend opening this mini solution through your Mac OS or ipad iOS system. I have tried it through my Windows 10 system, but found the fly images were disregarded.

Challenge 2: how would you help teammates draw the circadian genetic research - related graphs that can communicate well with readers?

To solve this problem, it may be interesting to observe what the top worst genetic graph looks like. Some critical scientists list the Figure 1 - Δp plotted versus the susceptibility-allele frequency for patients. of Wittke-Thompson JK, Pluzhnikov A, Cox NJ (2005) Rational inferences about departures from Hardy-Weinberg equilibrium. American Journal of Human Genetics 76:967-986 as one of the top 10 worst graphs. Why do they question this Figure 1?

Bonus - How to display data badly?

Challenge 3: everyone in my team gets used to GraphPad (non programming language). How would you communicate with teammates to use programming language to draw all circadian genetic research - related statistical graphs?

GraphPad is born to make plotting smooth without programming requirements for scientists. It is so easy and attractive with "Export Publication-Quality Graphs With One Click".

Thanks for your challenge. This one is really tough, because it indirectly asks the respondent to do something ahead of the leading industry. GraphPad is leading and excellent as it says in its many slogans with "One Click".

When we purely use mouse clicks and have the template at hand, drawing one graph like Figure 1 should be easy in GraphPad, but before having the template, we may have to click more than one time to set the parameters, for example, file type, resolution, transparency, dimensions, and color space RGB/CMYK, in order to create a template.

Figure 1. Total sleep time in Drosophila.

When we hope to draw a figure made up of 12 customized subplots like Figure 2, purely using mouse clicks to accomplish this graphical representation task may take more time, not to mention combining different types of subplots (boxplot variant and cumulative relative frequency distribution plot in this case) into one figure like Figure 3.

Figure 2. Sleep time in Drosophila.

Figure 3. Sleep time in Drosophila.

We are so accustomed to and heavily dependent on mouse clicking in our daily work. When we have to draw differential gene expression analysis - related plots like Figure 4 by coding (unfortunately, DESeq2 package in R does not have its perfect match in GraphPad), we may spend a considerable amount of time on learning a programming platform, to figure out how to install it, what each syntax means, which one we should use, why error messages pop up, where to find the solution, and so on. We are not accustomed to quickly switching to programming after immersed in GraphPad. Even worse, in contrast to mouse clicking, if we do not practice coding often, we will forget it rapidly. So next time when we have to switch from GraphPad to programming languages to draw statistical graphs, we have to take time again to recap the coding knowledge. When we rely on GraphPad, we need to account for the time cost for switching tools. Under this situation, we may wonder, it is possible to draw all statistical graphs using one tool?

Figure 4. RNA-seq reveals changes in gene expression during chronic social isolation in humans.

Moreover, suppose the average number of clicks for completing one plot is ten, the average time interval between two clicks is one second (including the time for cursor movements), and we do 1000 times wet experiments with each experiment corresponding to one plot, we have to click 10000 times for drawing a total of 1000 plots. The most idealistic solution for completing 1000 plots is to draw them at a time without any distraction. This means we have to sit down in front of computers for ~ 2.8 hours, focusing soly on keeping clicking the right menu/icon and moving mouses to the right place. We are not machines. It is hard to suppress the wander of our minds when we repeat doing something. Further, with repeated mouse activities, our fingers, eyes, brains, and even lumbar vertebrae may feel painful bit by bit, which can negatively impact on our research motivation. For high-volume plotting, GraphPad’s strong affinity for mouse clicking (no affinity for programming), in fact, steals our time, health, and motivation.

We may wonder, how powerful we want programming to achieve? May team members output a plot like Figure 3 automatically and fast, even with inputting no parameter? May they avoid keeping clicking/moving mouses or walking through help documents that describe the usage of different packages and programming platforms?

Type the following codes marked by gray strips in your R/RStudio console.

install.packages("devtools")

library(devtools)

install_github("anniliu1/communications")

library(communications)

?boxplot_sleep

The help document should pop up after you type ?boxplot_sleep in your console.

Scroll down that help page and you will see the Examples section with several code snippets.

Copy and paste all code snippets into your console and press 'ENTER/RETURN' on your keyboard. Let us see you producing your Figure 3 in ~30 seconds.

When you read this line, please stand up at the computer desk, take your eyes off the computer, and have a rest for at least 5 minutes. Your health is more important than finishing reading this document. Thanks for your cooperation.

------

------

------

------

------

------

------

------

------

------

------

------

------

------

Challenge 4: how would you communicate with teammates majoring in biology and chemistry if they are curious about something involving statistics?

For example, they ask the following question: Suppose a sleep disorder called XYZ is a genetic disease due to defects in the CRY1 gene. Each individual has two copies of the CRY1 gene, one copy from the father and one copy from the mother. Each copy of the CRY1 gene can be dominant, or recessive. An individual can have (a) two copies of the dominant version (does not have the disease; cannot transmit the disease), (b) one copy of the dominant version and one copy of the recessive version (the individual does not have the disease; the individual is a carrier and can transmit the disease to progeny), or (c) two copies of the recessive version (the individual has the disease, and can transmit the disease to progeny). For any given child, each parent passes one of his/her two copies of the CRY1 gene to the child and each copy has a 0.5 probability of being passed on. They want to know what is the probability of a child having the XYZ disease if one parent is a carrier and the other parent has the XYZ disease?

To solve this problem, it may be interesting to see how to confuse our teammates by using statistical/mathematical jargon.

The first bad way is:

The child will have the XYZ disease if he/she receives two recessive genes. This is the Pr(a recessive gene from the parent who is the carrier and recessive gene from the parent with disease). These are independent events so this equals Pr(a recessive gene from the parent who is the carrier) × Pr(a recessive gene from the parent with disease) = 0.5 × 1.0 = 0.5

The second bad way is :

Notation father: f; mother: m; dominant version: d; recessive version: r 
The sample space is [(f_d, m_r), (f_d, m_r), (f_r, m_r), (f_r, m_r), (f_r, m_d), (f_r, m_r), (f_r, m_d), (f_r, m_r)].
Therefore, Pr(child has the disease) = 4/8 = 1/2.

The good way is: use the Punnett Squares that teammates may prefer to use.

Notation dominant version: d; recessive version: r 

| | r | r |---|------|------- | d | dr | dr | r | rr | rr

Challenge 5: if you first encounter the ChIP sequencing field and teammates assign you to attend a meeting where several companies present their sequencing techniques, how would you listen to sales representatives? Your solution is expected to be transferred to any other sequencing techniques, not limited to ChIP.

A sales representative from the alpha company of the largest scale in the ChIP field says, 'Our ChIP sequencing technique is an unmatched approach to detect the interaction between genome-wide DNA and transcription factors. With our industry-leading data quality and integrated workflows that simplify the whole sequencing process from library preparation to downstream data analysis, we are passionate about bringing our expertise to your research projects'.

Recall that profit is the essence of business. That means there should be some differences between what a sales representative says about his product and the objective truth of his product.

First, thoroughly listen to what experienced scientists say about the critical limiting factors for the ChIP sequencing technique. Then, thoroughly listen to what sales representatives of alpha and other competitors say about how their ChIP sequencing techniques deal with those critical limiting factors. Third, based on what we listen, thoroughly think what quantitative indices can substitute those underlined adjectives. Finally, if, after several-round assessments, alpha’s product indeed presents the highest quality, though not perfect, amongst other competitors, we may accept it but take more time to thoroughly see what the sequencing hardware and software look like on the inside, and estimate our likelihood to advance alpha’s product. If the chance of making technical advancements is over 0.50 in the 5 years and there are huge possibilities of revealing new things when we go through old questions by leveraging new techniques, then make advances now without hesitation.

Challenge 6: how would you help the team to communicate with the research committee, so that the committee wants invest in the team's research plan?

ABC University raises $20 million to help caring for the susceptible individuals and patients of any adverse events and disorders. The research committee aims to fund the promising research that translates basic science findings to clinical patient care. Professor DEF previously led a seminal drosophila and rat study, showing that the interaction between CRY c.xxx A>C and DBT c.xxx A>G is negatively associated with the occurrence of the feeling of being tired and slightly confused when there is an evident difference in the time, and this study estimates as many as 3 in 1000 people of European descent carry at least one copy of these two variants. The professor DEF guesses their genetic evidence shown in animals can apply to the general human population. He plans a human genetic study to support his guess. What kind of study design sounds rational in the eyes of the research committee? And suppose the frequency of having both two variants in people of European descent is 0.3%, what is the least number of subjects should be sampled by professor DEF so that he has at least 90% chance of having at least one subject with the target two variants in his sample?

The mini solution: - Case-control study may be appropriate. When there is an evident difference in the time, the feeling of being tired and slightly confused (control) is prevalent but no feeling of being tired and slightly confused (case) may be not prevalent. Also, compared to totally no feeling of being tired and slightly confused when tranfering to another time zone, it may be easier to find the less intense feeling of being tired and slightly confused. Considering this situation, we may further define the feeling of being tired and slightly confused as ordinal outcomes, intead of binary outcomes. - Professor DEF may sample at least 767 subjects, so that he has 90% confidence that his sample contains at least one subject with the CRY c.xxx A>C and DBT c.xxxA>G. 767 can be derived from the binomial probability mass function where having target two variants is regarded as the success and 0.3% as the probability of success.

Again, it is not good to mention statistical terminology in front of geneticists and biologists. Some software and online resources automatically output the sample size by only asking users to click or type the parameters such as minimum detectable difference, power, prevalence of historical value (group 1), significance level, and sideness. Unlike plotting, it may be worse to act like those software and online resources, conveying to you an automated package/code/interface apropos the sample size calculation. Because it totally goes against the affinity for understanding why choosing a specific number of subjects will deliver the result you want or you do not want.

This sample size question can be reversed into the one asks "what is the least number of subjects should be sampled by professor DEF so that he has at most 10% chance of having no subject with the target two variants in his sample?".

Challenge 7: This is the final challenge. If you are given a chance to improve your solution for the first challenge, what will you do?

Recap the first challenge: how would your data communicate well with teammates so that they can inspire them to choose wise and insightful scientific questions?

This question comes down to our tastes. It comes down to expose ourselves to the best things that humans have done in diverse areas (athletics, history, sociology, psychology, literature, music, biology, chemistry, statistics, programming, etc.) and try to bring these things into data as a medium to express our feeling and vision.



anniliu1/communications documentation built on Jan. 2, 2022, 7:20 p.m.