\newline

For drawing the diagram, I used 'ggalluvial' package which is an extension package of 'ggplot2'. Below code loads packages used in my code.

library(ggplot2)
library(ggalluvial)

  \newline

The initial shape of the data is as below.

print(read.table("https://kimdaeh.github.io/ASSIGNMENT/my_data.txt", stringsAsFactors=F, sep="\t") )

  \newline

I reformat the data into the long format that is easy to handle for the packages ggplot and ggalluvial. Also, the reformatted data will contain some variables that will be required for adjusting the shape of the diagram in the wanted form.

reform <- function (data) {     
    year <- c( rep("1990" , 5) , rep("1995" , 5) , rep("2000" , 5) , rep("2005" , 5) , rep("2010" , 5) )
    condition <- rep( colnames(data)[2:6] , 5 )
    value <- c( data[1 , ][2 : 6], data[2 , ][2 : 6], data[3 , ][2 : 6], data[4 , ][2 : 6], data[5 , ][2 : 6] ) 
    value <- unlist( value )

# List the name of the disease in the order of the value of the fraction as of 1990. (In order to make y axis label)
    ordering0 <- rbind( rank( data[1 , c(2:6)]), rank( data[2 , c(2:6)]), +
               rank( data[3 , c(2:6)]), rank( data[4 , c(2:6)]), rank(data[5, c(2:6)]) )
    ordering <- c( ordering0[1 , ], ordering0[2 , ], ordering0[3 , ], ordering0[4 , ], ordering0[5 , ] )
    data.for.graph <- data.frame( year , condition , value, ordering)

# fakename: I found that the order of layers (flow) drawn on the diagram is in the alphabetical order of the disease name. However, the diagram in the Excel file is not so. Therefore, I reorder the sequence of the disease in accordance with the order in the original diagram. 
    temp = data.for.graph$condition
    fakename = ifelse( temp=="Obesity", "A", 
                ifelse (temp=="Hypercholesterolemia", "B",
                    ifelse (temp=="Diabetes", "C",
                        ifelse (temp=="Smoking", "D", "E"))))
    data.for.graph <- data.frame(data.for.graph, fakename)    

# fakevalue: The height of the stratum for a disease which has very small value of fraction is too short so the diagram drawn does not display such disease (e.g., Obesity in 1990). Therefore, I adjusted the value of fraction of each disease in a way the plotted diagram can display it.
    data.for.graph$fakevalue = ifelse (data.for.graph$value<=0.001, 0.025, data.for.graph$value)
    data.for.graph$fakevalue = ifelse (data.for.graph$value<0.02 & data.for.graph$value>=0.01, +
                0.03, data.for.graph$fakevalue)

    data.for.graph
}
my_data <- read.table("https://kimdaeh.github.io/ASSIGNMENT/my_data.txt", stringsAsFactors=F, 
                 sep="\t")
data.for.graph <- reform(my_data)
head(data.for.graph, 5)

Explanation: (1) The diagram will use the column 'fakevalue' for determining the height of each stratum. (2) The flow (alluvium) will be drawn sequentially in the order of the alphabets in the column 'fakename'. (3) The y label will be added in the order of the numbers in the column 'ordering'.

  \newline

This function takes in the original data file and produce four versions of Sankey diagram (With and without the white space separating shapes of the same color from one another, and with and without the white space separating shapes of different colors from one another.)

# (a,b) stands for the version of the diagram (in terms of white lines)
sankeyplot <-function(data, a, b) {     
    data.for.graph <- reform(data)  # reformat the original data according to function 2

# Colors of the each flow and stratum are in accordance with the original diagram in the Excel file ( "Obesity: A" / "Hypercholesterolemia: B" / "Diabetes: C" / " Smoking: D" / " Hypertension: E" )
    cols <- c(A ="#4D6D94", B ="#D88F54", C="#D9654E", D="#7EACC5", E="#8ABA8F")

    p = ggplot( data.for.graph,  aes(x = year, y = fakevalue, 
        fill = reorder(fakename, -ordering), alluvium = fakename , stratum = fakename)) 
    p = p + scale_fill_manual(values = cols)


# Diagram title and other formats
    p = p + theme_bw() + 
        ggtitle("Risk Factors for Stroke in Blacks") + 
        theme( plot.title = element_text(size=15, hjust = 0.5 , margin=margin(0,0,30,0)),
            panel.border = element_blank(),     
            panel.grid.major = element_blank(),     
            panel.grid.minor = element_blank(),     
            legend.position = "none",   
            axis.title=element_blank(), 
            axis.ticks=element_blank(),     
            axis.text.y=element_blank(),
            axis.text.x=element_text(size=13) ) +
        scale_x_discrete(position = "top") 


## Plotting the alluvial diagram (Four cases)
    # [a==0 & b==0] Vertical White (x), Horizontal White (x)
    if (a==0 & b==0) { p = p+geom_alluvium(width = 0.7, knot.pos = 0, alpha = 1, 
            decreasing = F, fill = c( cols, rep(NA,20) ) ) } 

    # [a==1 & b==0] Vertical White (o), Horizontal White (x)
    else if (a==1 & b==0) { p = p+geom_alluvium(width = 0.7, knot.pos = 0, alpha = 1, 
            decreasing = F, fill = c( cols, rep(NA,20) ) )
            x <- c(1990, 1995, 2000, 2005, 2010)
            for (val in x) {
                p = p + geom_bar(width = 0.7, colour = 'white', size =1.5, 
                    position="stack", stat="identity", 
                    data = data.for.graph[data.for.graph$year == val , ]) 
            }

            for (val in x) {
                p = p + geom_bar(width = 0.67, size =1.5, position="stack", 
                    stat="identity", 
                    data = data.for.graph[data.for.graph$year == val , ] )
            } 
    }

    # [a==0 & b==1] Vertical White (x), Horizontal White (o)
    else if (a==0 & b==1) { p = p+geom_alluvium(colour = "white", size = 1.5, 
            width = 0.7, knot.pos = 0, alpha = 1, decreasing = F, fill = c( cols, rep(NA,20) ) ) } 

    # [a==1 & b==1] Vertical White (o), Horizontal White (o)
    else if (a==1 & b==1) { p = p+geom_alluvium(colour = "white", size = 1.5, 
            width = 0.7, knot.pos = 0, alpha = 1, decreasing = F, 
            fill = c( cols, rep(NA,20) ) ) +
            geom_stratum(color="white", size =1.5, width = 0.7, decreasing = F) }


# Adding labels (values) on each stratum (box)
    p = p+ geom_text(aes(label = sprintf( "%0.2f", round(value+0.00001, digits = 2) )), 
        stat = "stratum", color="white", decreasing = F, size=4)


# Draw an invisible bar around y axis to secure the space for adding y axis label
    p = p+ geom_bar(width = 0.8, position="stack", stat="identity", 
        data= data.for.graph[data.for.graph$year == 1990 , ] , aes( x=0.3), alpha = 0 )


# Adding y-axis label
    p = p + geom_text( data= data.for.graph[data.for.graph$year == 1990 , ], 
        aes(label = condition , x=0.4), size = 3.5, color = "black", 
        position = position_stack(vjust = 0.5))


# Error message
    if ( (a!=0 & a!=1) | (b!=0 & b!=1) ) 
        {print("Error: the arguments 'a' and 'b' of the function 'sankey.plot(data, a, b)' should have the value of either 0 or 1.") } 
    else {p}
}

  \newline

In the function sankey.plot(data, a, b), 'a' (0 or 1) is the indicator for the existence of the vertical white lines and 'b' (0 or 1) is the indicator for the existence of the horizontal white lines.

sankeyplot(my_data, 0, 0)
sankeyplot(my_data, 1, 0)
sankeyplot(my_data, 0, 1)
sankeyplot(my_data, 1, 1)

  \newline

If the arguments (a, b) of the function do not have value of either 0 or 1, an error message is displayed.

sankeyplot(my_data, 1.5, 1)


kimdaeh/daehyun.sankey documentation built on March 10, 2020, 12:06 a.m.