R/compareMixtureToData.R

Defines functions compareMixtureToData compareMixtureToDataDirect oneByOnePrediction

Documented in compareMixtureToData compareMixtureToDataDirect oneByOnePrediction

#################################
## 

###############################
#' @title Compare a mixture solution to some data.
#'
#' @description
#' This function takes a mixture solution, a data matrix, and a mapping of data observations into clusters. It then predicts what it expects the mixture solutions to look like.
#'
#' It is essential to understand that there are two key structures being explored simultaneously in these plots. The first is the P-dimensional clustering of the data, which defines the similarities; each of the N data points has a similarity to these P clusters. A representation in this space is called a palette. The second is the K-dimensional set of admixture weights. Each of the K "ancestral" or latent variables also has a P dimensional palette. Further, each of the N data points has a K dimensional "admixture" breakdown, seen as a mixture of the K ancestral palettes.
#'
#' We must be able to match up each individual to the palette which it represents, so that we can order the individuals according to the palette. As such, this implemetation reorders the individuals, and may reorder the clusters themselves, in order to provide a clean representation of the data. If you don't want this, you may try using \code{\link{compareMixtureToDataDirect}}).
#' 
#' @param mix A proposed mixture solution of dimension N by K; for example, as generated by STRUCTURE or ADMIXTURE. 
#' @param dataraw A matrix containing the data of dimension N by P, for example, as generated by ChromoPainter. The data must reflect similarity to P clusterings of the data. Note that we assume that this matrix is correctly normalised (all the rows sum to the same value); if this is not true then the results may be strange. If in doubt, provide dataraw/rowSums(dataraw) to normalise the rows to sum 1.
#' @param fam An N by 2 or more data frame consisting of the fam file that generated the data. The used parts are: column 1: the cluster membership that created the groups in dataraw (with the column names in dataraw as the values). column 2: row names (for both the data and the mix, and in the same order). One of fam or ids must be present.
#' @param ids An N by 3 data frame consisting of: column 1: row names (for both the data and the mix). column 2: the cluster membership that created the groups in dataraw (with the column names in dataraw as the values). column 3: inclusion (0 for absent, 1 for present; NB only all present will currently work!). One of fam or ids must be present.
#' @param gap The spacing between populations, relative to the spacing between individuals which is 1. Default: 3
#' @param remself Number of iterations that "self-copying" (cluster specific sharing of drift) is removed. Set to 10 to essentially remove all self-copying.
#' @param ancestral Ancestral model, i.e. how the latent clusters are defined. There are two options. "mixture": meaning tht the populations are defined as a mixture of the individuals that comprise them according to the admixture model. This is not advisable because it allows ancestry from different true latent popluations to affect inference of others. "solve": find the definition of the clusters that best explains the data by root-mean-sqaure-distance. Default: "solve"
#' @param relabel Mapping from dendrogram labels to individual names. Should not be needed. Default: The identity function
#' @param tdend Dendrogram of the individuals to determine plot order. Must match the order of the individuals in mix and dataraw, and columns of dataraw. Default: NULL, meaning create this from the data
#' @param popdend Dendrogram of the populations to determine plot order. Must match the order of the individuals in mix and dataraw, and columns of dataraw. Default: NULL, meaning create this from the data
#' @param poplist Specify the full ordering of individuals and populations. Must match the order of the individuals in mix and dataraw, and columns of dataraw. Default: NULL, meaning create from the data
#' @param poporder Specify a population ordering. Default: NULL, meaning determine from the dendrogram
#' @param mycols The colour for each of the K ancestral populations; Default: NULL, meaning use rainbow(K). Can be modified in the returned object instead.
#' @param mycols2 The colour for each of the P clusters; Default: NULL, meaning determined using \code{\link{rgbDistCols}} so that similar clusters have similar colours. Can be modified in the returned object instead.
#' @keywords mixture chromopainter admixture structure finestructure
#'
#' @return An object of class admixfs, which is a list containing the following:
#' \itemize{
#'   \item mix The N by K admixture matrix, reordered by clusters.
#'   \item selfmatrix An N by P matrix of containing 0 except if individual i is in cluster j in which case it is 1.
#'   \item data.NbyP The N by P data matrix, reordered by clusters.
#'   \item data.PbyP A P by P matrix describing the similarity of each cluster to each other.
#'   \item K The number of ancestral or latent populations.
#'   \item P The number of clusters, defining the size of the palette.

#'   \item poplist A representation of the membership of each cluster. A list of clusters (reordered) each containing a character vector of the individuals in that cluster.
#'   \item tdend A dendrogram relating the clusters, used to define their order.
#'   \item coancestry.KbyP The palettes of the K ancestral populations.
#'   \item pred.NbyPatK The predicted palette of the data.
#'   \item meanpainting.KbyP The average palette.

#'   \item meandiff.KbyP The prediction - meanpainting
#'   \item meandiff.KbyP.over2A The amount that the mean prediction is over - The amount that the full predicction is over, with negative values set to zero
#'   \item meandiff.KbyP.over The amount that the mean prediction is over - The amount that the full predicction is over
#'   \item meandiff.KbyP.under2A The amount that the mean prediction is under - The amount that the full predicction is under, with negative values set to zero
#'   \item meandiff.KbyP.under The amount that the mean prediction is under - The amount that the full predicction is under

#'   \item same.KbyP The palettes that are the same in the data and the preciction
#'   \item diff.KbyP The prediction - data
#'   \item diff.KbyP.under The underprediction
#'   \item diff.KbyP.over The overprediction

#'   \item dist.KbyP.predfail The absolute error for each individual
#'   \item dist.KbyP=dist.KbyP The sum of the absolute errors

#'   \item tspace The distance of each individual in the plot to its left neighbour, allowing for spaces between populations. (Locations are given by \code{cumsum} of this)
#'   \item popxcentres The centre of each population in the plot
#'   \item mycols The colour for each of the K ancestral populations; Default: rainbow(K)
#'   \item mycols2 The colour for each of the P clusters; Default: determined using \code{\link{rgbDistCols}} so that similar clusters have similar colours.
#'   \item mycols2A mycols2 with white at the start, for when we plot the mean of all the data
#'   \item xrange The range of the x axis for the plot
#' }

#' @export
#' @examples
#' data(arisim_remnants)
#' 
#' ## The vanilla analysis highlights the existance of genetic drift
#' ## specific to Pop5 that isn't captured by the mixture
#' 
#' adm<-compareMixtureToData(arisim_remnants$mixture,
#'                           arisim_remnants$data,
#'                           ids=arisim_remnants$ids)
#' admplot=mixturePlot(adm)
#'
#' ## The same plot but with excess similarity within populations
#' ## removed highlights that the mixture solution overpredicts the
#' ## amount of Pop13 admixture in Pop5, and underpredicts the amount
#' ## of Pop6 admixture in the same population. The same is true for
#' ## Pop13 to a lesser extent. Each other population is well fit- the
#' ## variation seen is all within the clusters.
#' adm_remself<-compareMixtureToData(arisim_remnants$mixture,
#'                                   arisim_remnants$data,
#'                                   ids=arisim_remnants$ids,remself=10)
#' admplot_remself=mixturePlot(adm_remself)
#' 
compareMixtureToData<-function(mix,dataraw,
                               fam=NULL,
                               ids=NULL,
                               remself=0,
                               ancestral="solve",
                               relabel=I,
                               tdend=NULL,
                               popdend=NULL,
                               poplist=NULL,
                               poporder=NULL,
                               mycols=NULL,
                               mycols2=NULL,
                               gap=3){
    ## Make the data
    if(class(dataraw)!="matrix") {
        dataraw=as.matrix(dataraw)
        if(class(dataraw[,1])=="character") stop("dataraw contains characters. It must contain numeric values.")
    }
    if(class(mix)!="matrix"){
        mix=as.matrix(mix)
        if(class(mix[,1])=="character") stop("mix contains characters. It must contain numeric values.")
    }
    if(is.null(rownames(dataraw))){
        stop("No row names in dataraw. Since these are provided by mixPainter, they are expected to be set and are important for matching data.")
    }
    if(is.null(rownames(mix))){
        warning("No rownames provided in mix. Assuming that they match those in dataraw. If this is not true, you must set them manually.")
        rownames(mix)=rownames(dataraw)
    }
    if(is.null(fam) && is.null(ids)) stop("Must provide fam or ids file!")
    if(is.null(ids)){## Make a legitimate ids object from the fam
        if(length(unique(fam[,1]))==length(fam[,1])) {tgrp=2;tid=1
        }else if(length(unique(fam[,2]))==length(fam[,2])) { tgrp=1;tid=2
        }else stop("Invalid fam file provided!")
        ids=data.frame(id=as.character(fam[,tid]),group=as.character(fam[,tgrp]),retain=1,stringsAsFactors=FALSE)
    }
    ## Cluster by IDs
    tpops<-unique(ids[,2])
    poplistunsrt<-lapply(tpops,function(x){
        ids[ids[,2]==x,1]
    })
    names(poplistunsrt)<-tpops

    if(dim(dataraw)[1] == dim(dataraw)[2]){ ## Given a square matrix
        ## Create the data matrix using the ids
        rawdata.NbyP = matColSums(dataraw,poplistunsrt)
    }else{
        ## ids should reflect the clustering in the columns
        rawdata.NbyP = dataraw
    }

    if(is.null(popdend)){
        tdat<-t(rawdata.NbyP) ## Renormalise each column (as a row)
        tdat<-tdat/rowSums(tdat)
        popdend<-as.dendrogram(hclust(dist(log(tdat+0.001))))

    }
    if(is.null(poporder)){
        poporder<-labels(popdend)
    }
    if(is.null(tdend) &&is.null(poplist)){
        tdend<-popDendRelabelMembers(popdend,ids,relabel)
        poplist<-poplistunsrt[poporder]
    }else if(is.null(tdend)){
        tdend<-popDendRelabelMembers(popdend,ids,relabel)
    }else if(is.null(poplist)){
        poplist<-strsplit(labels(tdend),";")
    }
    indorder<-as.character(unlist(poplist))
        
    rawdata.NbyP<-rawdata.NbyP[,poporder]
    
    mix<-mix[indorder,]
    data.NbyP<-rawdata.NbyP[indorder,]

        ## Make the K by P matrix
    K<- dim(mix)[2]
    P<- dim(data.NbyP)[2]
    
    data.PbyP<-aggregrateDataForK(data.NbyP,tdend,P)
    
    compareMixtureToDataDirect(mix=mix,
                                     data.NbyP=data.NbyP,
                                     poplist=poplist,
                                     remself=remself,
                                     ancestral=ancestral,
                                     data.PbyP=data.PbyP,
                                     tdend=tdend,
                                     mycols=mycols,
                               mycols2=mycols2,
                               gap=gap)
}

#' @title Compare a mixture solution to some data, without reordering or changing any of the data.
#'
#' @description
#' This function takes a mixture solution, a data matrix, and a mapping of data observations into clusters. It then predicts what it expects the mixture solutions to look like.
#'
#' This function assumes that the data are ordered according to the clusters, and the clusters are ordered according to the palette. Arranging this is non-trivial so you may wish to use the version of this code that does this for you, \code{\link{compareMixtureToData}}). You can then easily call this function by reordering rows/columns of the data and the mixture, as shown in the example.
#' 
#' @param mix A proposed mixture solution of dimension N by K; for example, as generated by STRUCTURE or ADMIXTURE. 
#' @param data.NbyP A matrix containing the data of dimension N by P, for example, as generated by ChromoPainter. the data must reflect similarity to P clusterings of the data. 
#' @param adm The result of a call to compareMixtureToData. This is used to collect the plotting properties below (poplist, tdend, mycols, mycols2, gap)
#' @param poplist A list of length P, with elements containing the rownames of the above two matrices which were used in constructing the data column p in data.NbyP (a transformation of the ids give to \code{\link{compareMixtureToData}}).
#' @param gap The spacing between populations, relative to the spacing between individuals which is 1. Default: 3
#' @param remself Number of iterations that "self-copying" (cluster specific sharing of drift) is removed. Set to 10 to essentially remove all self-copying.
#' @param ancestral Ancestral model, i.e. how the latent clusters are defined. There are two options. "mixture": meaning that the populations are defined as a mixture of the individuals that comprise them according to the admixture model. This is not advisable because it allows ancestry from different true latent popluations to affect inference of others. "solve": find the definition of the clusters that best explains the data by root-mean-sqaure-distance. Default: "solve"
#' @param data.PbyP =NULL, # Needed for colours only
#' @param tdend Dendrogram of the populations to determine plot order. Default: NULL, meaning create this from the data
#' @param mycols The colour for each of the K ancestral populations; Default: NULL, meaning use rainbow(K). Can be modified in the returned object instead.
#' @param mycols2 The colour for each of the P clusters; Default: NULL, meaning determined using \code{\link{rgbDistCols}} so that similar clusters have similar colours. Can be modified in the returned object instead.
#' @keywords mixture chromopainter admixture structure finestructure
#'
#' @return An object of class admixfs ; see \code{\link{compareMixtureToData}}.
#'
#' @export
#' @examples
#' \dontrun{
#' data(arisimsmall)
#' ## Example where we reorder the populations manually
#' adm0<-compareMixtureToData(arisimsmall$mixture,arisimsmall$data,ids=arisimsmall$ids)
#' mypoplist=adm0$poplist[paste0("Pop",c(13,5:7,1:4,9,11,12))]
#' mymix=adm0$mix[unlist(mypoplist),]
#' mydata=adm0$data.NbyP[unlist(mypoplist),names(mypoplist)]
#' adm<-compareMixtureToDataDirect(mymix,mydata,mypoplist)
#' gpref=mixturePlot(adm,cex.names=0.5,residual.scale=0.2,height.cluster=.4)
#' }

compareMixtureToDataDirect<-function(
    mix,
    data.NbyP,
    poplist,
    remself=0, 
    ancestral="solve",
    adm=NULL, 
    data.PbyP=NULL, # Needed for colours only
    tdend=NULL, # Not obviously used
    mycols=NULL,
    mycols2=NULL,
    gap=3) # Reported for completeness
{
    K<- dim(mix)[2]
    P<- dim(data.NbyP)[2]

    if(!is.null(adm)){
        poplist=adm$poplist
        tdend=adm$tdend
        mycols=adm$mycols
        mycols2=adm$mycols2
        gap=adm$gap
    }
    
    if(any(unlist(poplist)!=rownames(mix))){
        stop("Populations are not ordered according to poplist!")
    }
    if(any(rownames(data.NbyP)!=rownames(mix))) {
        stop("Populations are not ordered the same in mix and data.NbyP!")
    }
    if(any(names(poplist)!=colnames(data.NbyP))){
        cat("Warning: poplist is not ordered as the columns of data.NbyP. This will result in palettes that do not match. Consider reordering.")
    }
    
    tpsep<-sapply(poplist,length)
    tspace<-rep(0,dim(mix)[1])
    tspace[1+cumsum(tpsep)]<-gap
    tspace<-tspace[-length(tspace)]
    xrange<-c(0,sum(1+tspace))
    popxcentres<-cumsum(tpsep+gap)-tpsep/2-gap
    

        if(is.null(data.PbyP)) data.PbyP=t(matColMeans(t(data.NbyP),poplist))
##### Compute everything that we need for the plot
    if(ancestral=="mixture"){
        mix.colnorm<-t(t(mix)/colSums(mix))
        coancestry.KbyP<-t(mix.colnorm) %*% data.NbyP
    }else if(ancestral=="solve"){
    ## Let alpha * B = C
    ## Where alpha is the N by K matrix of admixture proportions
    ## and B is the K by P matrix of the ancestral coancestry vectors
    ## and C is the N by P matrix of the current coancestry vectors
    ## Then: (alpha^T alpha) B = alpha^T C
    ## (alpha^T alpha)^-1 (alpha^T alpha) B = (alpha^T alpha)^-1 alpha^T C
    ## B = (alpha^T alpha)^-1 alpha^T C
    ##### NOTE: No guarantee of positivity!
        alphatalpha<-t(mix) %*% mix
        alphatalphainf<-solve(alphatalpha)
        coancestry.KbyP<-alphatalphainf %*% t(mix) %*% data.NbyP
    }else{
        stop(paste("Unimplemented ancestral model",ancestral))
    }
    
    pred.NbyPatK<-mix %*% coancestry.KbyP
    if(remself>0){
        ## coancestry.KbyP
        for(i in 1:remself){
            data.NbyP<-oneByOnePrediction(mix,data.NbyP,coancestry.KbyP,pred.NbyPatK,
                                          clusterlist=poplist,ancestral=ancestral)
            coancestry.KbyP<-alphatalphainf %*% t(mix) %*% data.NbyP
            pred.NbyPatK<-mix %*% coancestry.KbyP
        }
    }
    
    meanpainting.KbyP<- colMeans(pred.NbyPatK)
    
    meandiff.KbyP<-t(t(pred.NbyPatK) - meanpainting.KbyP)
    meandiff.KbyP.over<-meandiff.KbyP
    meandiff.KbyP.under<- - meandiff.KbyP
    meandiff.KbyP.over[meandiff.KbyP.over<0]<-0
    meandiff.KbyP.under[meandiff.KbyP.under<0]<-0
    
    same.KbyP<-data.NbyP
    for(i in 1:dim(same.KbyP)[1]) for (j in 1:dim(same.KbyP)[2]) {
        same.KbyP[i,j]<-min(pred.NbyPatK[i,j],data.NbyP[i,j])
    }

    diff.KbyP<-pred.NbyPatK - data.NbyP
    diff.KbyP.over<-diff.KbyP
    diff.KbyP.under<- - diff.KbyP
    diff.KbyP.over[diff.KbyP.over<0]<-0
    diff.KbyP.under[diff.KbyP.under<0]<-0
    
    dist.KbyP<-rowSums(abs(diff.KbyP))
    
#########
    meandiff.KbyP.under2<-meandiff.KbyP.under - diff.KbyP.under
    meandiff.KbyP.over2<-meandiff.KbyP.over - diff.KbyP.over
    meandiff.KbyP.under2[meandiff.KbyP.under2<0]<-0
    meandiff.KbyP.over2[meandiff.KbyP.over2<0]<-0
    
    meandiff.KbyP.under2A<-
        cbind(rowSums(meandiff.KbyP.under)-rowSums(meandiff.KbyP.under2),
              meandiff.KbyP.under2)
    meandiff.KbyP.over2A<-
        cbind(rowSums(meandiff.KbyP.over)-rowSums(meandiff.KbyP.over2),
              meandiff.KbyP.over2)

    dist.KbyP.predfail<-(abs(meandiff.KbyP.over2A[,1])+abs(meandiff.KbyP.under2A[,1]))
    
#########

## Find out who everyone is
    popin.K<-as.numeric(sapply(rownames(mix),function(x){
        which(sapply(poplist,function(y)x%in%y))
    }))
    
    selfmatrix<-data.NbyP
    selfmatrix[]<-0
    for(i in 1:length(popin.K)) selfmatrix[i,popin.K[i]]<-1
    ## FINISH
    
    ## Make the plot
    if(is.null(mycols)) mycols=rainbow(K)
    if(is.null(mycols2)) mycols2<-rgbDistCols(data.PbyP)
    mycols2A<-c("#000000",mycols2)
    

    ret<-list(mix=mix,
              selfmatrix=selfmatrix,
              data.NbyP=data.NbyP,
              data.PbyP=data.PbyP,
              K=K,
              P=P,

              poplist=poplist,
              tdend=tdend,
              
              coancestry.KbyP=coancestry.KbyP,
              pred.NbyPatK=pred.NbyPatK,
              meanpainting.KbyP=meanpainting.KbyP,
              
              meandiff.KbyP=meandiff.KbyP,
              meandiff.KbyP.over2A=meandiff.KbyP.over2A,
              meandiff.KbyP.over=meandiff.KbyP.over,
              meandiff.KbyP.under2A=meandiff.KbyP.under2A,
              meandiff.KbyP.under=meandiff.KbyP.under,
              
              same.KbyP=same.KbyP,
              diff.KbyP=diff.KbyP,
              diff.KbyP.under=diff.KbyP.under,
              diff.KbyP.over=diff.KbyP.over,
              
              dist.KbyP.predfail=dist.KbyP.predfail,
              dist.KbyP=dist.KbyP,
              
              tspace=tspace,
              popxcentres=popxcentres,
              mycols=mycols,
              mycols2=mycols2,
              mycols2A=mycols2A,
              xrange=xrange)
    class(ret)<-"admixfs"
    return(ret)
}

#' @title Obtain a prediction for the ancestral clusters, removing drift
#'
#' @description
#' Obtain the prediction of a mixture for each cluster (a P by P matrix), and return a matrix of this dimension but with replace the data for that cluster by its prediction, so that the diagonal is forced to match the prediction. This removes "genetic drift" specific to a given population.
#'
#' 
#' @param mix A proposed mixture solution of dimension N by K; for example, as generated by STRUCTURE or ADMIXTURE. 
#' @param data.NbyP A matrix containing the data of dimension N by P, for example, as generated by ChromoPainter. the data must reflect similarity to P clusterings of the data. 
#' @param coancestry.KbyP A matrix of the cluster averaged data
#' @param pred.NbyPatK A matrix of the cluster predictions
#' @param tdend A dendrohram of the populations, used for ordering. Default: NULL, meaning use clusterlist
#' @param clusterlist A list of populations, as described in \code{\link{compareMixtureToData}}. Default: NULL, meaning use tdend
#' @param ancestral Ancestral model, as described in \code{\link{compareMixtureToData}}. Default: "solve"
#'
#' @return A matrix of dimension P by P
oneByOnePrediction<-function(mix,data.NbyP,coancestry.KbyP,pred.NbyPatK,
                                  tdend=NULL,clusterlist=NULL,ancestral="solve")
{
    if(is.null(clusterlist)){
        if(!is.null(tdend)) {
            clusterlist<-sapply(labels(tdend),function(x)strsplit(x,";")[[1]])
        }else stop("Need to have valid clusterlist or dendrogram from which to construct it")
    }
    
    P<-length(clusterlist)
    tdata.KbyP<-lapply(1:P,function(p){
        trows<-which(rownames(data.NbyP) %in% clusterlist[[p]])
        testdata.NbyP.forp<-data.NbyP
        testdata.NbyP.forp[trows,]<-pred.NbyPatK[trows,]

        if(ancestral=="mixture"){
            mix.colnorm<-t(t(mix)/colSums(mix))
            testcoancestry.KbyP<-t(mix.colnorm) %*% testdata.NbyP.forp
        }else if(ancestral=="solve"){
            alphatalpha<-t(mix) %*% mix
            alphatalphainf<-solve(alphatalpha)
            testcoancestry.KbyP<-alphatalphainf %*% t(mix) %*% testdata.NbyP.forp
        }else{
            stop(paste("Unimplemented ancestral model",ancestral))
        }
        pred.NbyPatK<-mix %*% testcoancestry.KbyP
        ret<-data.NbyP[trows,]
        ret[,p]<-pred.NbyPatK[trows,p]
        return(ret/rowSums(ret))
    })
    
    retunsrt<-do.call("rbind", tdata.KbyP)
    ret<-retunsrt[rownames(data.NbyP),]
    ret
}
danjlawson/badMIXTURE documentation built on Sept. 27, 2019, 9:11 p.m.