Table of contents

  1. Table of contents
  2. Methods
    1. Model
    2. The DeMixT algorithm for deconvolution
      1. Parameter estimation using iterated conditional modes (ICM)
      2. The GSCM approach to improve model identifiability
      3. Simulation study for the GSCM approach
        1. Library loading
        2. Data simulation
        3. Running DeMixT with/without GSCM
        4. Supplementary Figure 1
  3. Data analysis
    1. Installing DeMixT
    2. Summary statistics for performance evaluation
      1. Concordance correlation coefficient (CCC)
      2. Measure of reproducibility
    3. Mixed tissue microarray dataset
      1. Library loading
      2. Data generation and processing
      3. Supplementary Table 7
      4. Running DeMixT
      5. Results summary
        1. Supplementary Tables 3, 4 and Table 1
        2. Supplementary Figure 3
        3. Figure 2a
        4. Supplementary Figure 4
    4. Mixed cell line RNA-seq dataset
      1. Library loading
      2. Data generation and processing
      3. Supplementary Table 7
      4. Running DeMixT
      5. Results summary
        1. Supplementary Tables 5, 6 and Table 1
        2. Supplementary Figure 5
        3. Figure 2b
        4. Supplementary Figure 6
    5. Laser-capture microdissection (LCM) prostate cancer FFPE microarray dataset
      1. Library loading
      2. Data generation and processing
      3. Supplementary Table 7
      4. Running DeMixT
      5. Results summary
        1. Figure 3a
        2. Figure 3b and 3c
        3. Supplementary Figure 7
        4. Figure 3d
        5. Supplementary Figure 8
        6. Supplementary Figure 9
    6. TCGA HNSCC data
      1. Load data
      2. Phase I in our pipeline
      3. Phase II in our pipeline
      4. Phase III in our pipeline
      5. Results summary
        1. Figure 4a
        2. Figure 4b
        3. Figure 4c
        4. Figure 4d
        5. Supplementary Figure 11

Methods

Model

Let \(Y_{ig}\) be the observed expression levels of the raw measured data from clinically derived malignant tumor samples for gene \(g, g = 1, \cdots, G\) and sample \(i, i = 1, \cdots, S\). \(G\) denotes the total number of probes/genes and \(S\) denotes the number of samples. The observed expression levels for solid tumors can be modeled as a linear combination of raw expression levels from three components:
\[\begin{equation} {Y_{ig}} = \pi _{1,i}N_{1,ig} + \pi _{2,i}N_{2,ig} + (1 - \pi_{1,i} - \pi _{2,i}){T_{ig}} \label{eq:1} \end{equation}\]

Here \(N_{1,ig}\), \(N_{2,ig}\) and \({T_{ig}}\) are the unobserved raw expression levels from each of the three components. We call the two components for which we require reference samples the \(N_1\)-component and the \(N_2\)-component. We call the unknown component the T-component. We let \(\pi_{1,i}\) denote the proportion of the \(N_1\)-component, \(\pi_{2,i}\) denote the proportion of the \(N_2\)-component, and \(1 - \pi_{1,i} - \pi_{2,i}\) denote the proportion of the T-component. We assume that the mixing proportions of one specific sample remain the same across all genes.
Our model allows for one component to be unknown, and therefore does not require reference profiles from all components. A set of samples for \(N_{1,ig}\) and \(N_{2,ig}\), respectively, needs to be provided as input data. This three-component deconvolution model is applicable to the linear combination of any three components in any type of material. It can also be simplified to a two-component model, assuming there is just one \(N\)-component. For application in this paper, we consider tumor (\(T\)), stromal (\(N_1\)) and immune components (\(N_2\)) in an admixed sample (\(Y\)).
Following the convention that \(\log_2\)-transformed microarray gene expression data follow a normal distribution, we assume that the raw measures \(N_{1,ig} \sim LN({\mu _{{N_1}g}},\sigma _{{N_1}g}^2)\), \(N_{2,ig} \sim LN({\mu _{{N_2}g}},\sigma _{{N_2}g}^2)\) and \({T_{ig}} \sim LN({\mu _{Tg}}, \sigma _{Tg}^2)\), where LN denotes a \(\log_2\)-normal distribution and \(\sigma _{{N_1}g}^2\),\(\sigma _{{N_2}g}^2\),\(\sigma _{Tg}^2\) reflect the variations under \(\log_2\)-transformed data . Consequently, our model can be expressed as the convolution of the density function for three \(\log_2\)-normal distributions. Because there is no closed form of this convolution, we use numerical integration to evaluate the complete likelihood function (see the full likelihood in the Supplementary Materials).

The DeMixT algorithm for deconvolution

DeMixT estimates all distribution parameters and cellular proportions and reconstitutes the expression profiles for all three components for each gene and each sample, as shown in equation (1). The estimation procedure (summarized in Figure 1b) has two main steps as follows.
1. Obtain a set of parameters \(\{\pi_{1,i}, \pi_{2,i}\}_{i=1}^S\), \(\{\mu_T, \sigma_T\}_{g=1}^G\) to maximize the complete likelihood function, for which \(\{\mu_{N_{1,g}}, \sigma_{N_{1,g}}, \mu_{N_{2,g}}, \sigma_{N_{2,g}}\}_{g=1}^G\) were already estimated from the available unmatched samples of the \(N_1\) and \(N_2\) component tissues. This step is described in further details below in parameter estimation and the GSCM approach.
2. Reconstitute the expression profiles by searching each set of \(\{n_{1,ig},n_{2,ig}\}\) that maximizes the joint density of \(N_{1,ig}\), \(N_{2,ig}\) and \(T_{ig}\)
\[\begin{equation} \mathop {\arg \max }\limits_{{n_{1,ig}},{n_{2,ig}}} \phi (\frac{{{y_{ig}} - {{\hat \pi }_{1,i}}{n_{1,ig}} - {{\hat \pi }_{2,i}}{n_{2,ig}}}}{{1 - {{\hat \pi }_{1,i}} - {{\hat \pi }_{2,i}}}}\left| {{{\hat \mu }_{{T_g}}},{{\hat \sigma }_{{T_g}}}} \right.)\phi ({n_{1,ig}}\left| {{{\hat \mu }_{{N_{1g}}}},{{\hat \sigma }_{{N_{1g}}}}} \right.)\phi ({n_{2,ig}}\left| {{{\hat \mu }_{{N_{2g}}}},{{\hat \sigma }_{{N_{2g}}}}} \right.) \end{equation}\]

where \(\phi(.|\mu, \sigma^2)\) is a log2-normal distribution density with location parameter \(\mu\) and scale parameter \(\sigma\).
In step 2, we combined the golden section search method with successive parabolic interpolations to find the maximum of the joint density function with respect to \(n_{1,ig}\) and \(n_{2,ig}\) that are positively bounded and constrained by \(\hat \pi_{1,i} n_{1,ig} + \hat\pi_{2,i} n_{2,ig} \le y_{ig}\). The value of \(t_{ig}\) is solved as \({y_{ig}} - {{\hat \pi }_{1,i}}{n_{1,ig}} - {{\hat \pi }_{2,i}}{n_{2,ig}}\).

Parameter estimation using iterated conditional modes (ICM)

In step 1, the unknown parameters to be estimated can be divided into two groups: gene-wise parameters, \(\{\mu_T,\sigma_T\}_{g=1}^G\), and sample-wise parameters, \(\{\pi_1,\pi_2\}_{i=1}^S\). These two groups of parameters are conditionally independent (Figure 1b).
For each pair of gene-wise parameters, we have
\({\{ {\pi _1},{\pi _2}\} _i} {\perp\!\!\!\perp} {\{ {\pi _1},{\pi _2}\} _j} \mid {{{\{ {\mu _T},{\sigma _T}\} }_1}, \cdots ,{{\{ {\mu _T},{\sigma _T}\} }_G}}\), for all \(i \neq j \in \{1, \cdots, S\}\),
and similarly for each pair of sample-wise parameters, we have
\({\{ \mu _T,\sigma _T\} _i} {\perp\!\!\!\perp} {\{ \mu _T,\sigma _T\} _j}\mid {{{\{ \pi _1,\pi _2\}}_1}, \cdots ,{{\{ \pi _1,\pi _2\} }_S} }\), for all \(i \neq j \in \{1, \cdots, G\}\).
These relationships allow us to implement an optimization method, ICM, to iteratively derive the conditional modes of each pair of gene-wise or sample-wise parameters, conditional on the others . Here, \({\pi _1},{\pi _2}\) are constrained between \(0\) and \(1\), and \(\mu_T, \sigma_T\) are positively bounded. We combined a golden section search and successive parabolic interpolations to find a good local maximum in each step. As shown by Besag , for ICM, the complete likelihood never decreases at any iteration and the convergence to the local maximum is guaranteed. Our ICM implementation is described in Figure 5.

The GSCM approach to improve model identifiability

Due to the high dimension of the parameter search space, and often flat likelihood surfaces in certain regions of the true parameters (e.g., \(\mu_1\)=\(\mu_2\)) that will be encountered by ICM (Supplementary Figure 2), we have developed a GSCM approach (illustrated in Figure 1b) to focus on the hilly part of the likelihood space. This reduces the parameter search space and improves the accuracy and computational efficiency. Here, we describe our general strategy. As there are large variations in the number of genes that are differentially expressed across datasets, the actual cutoffs may be adjusted for a given dataset.
Stage 1
We first combine the \(N_1\) and \(N_2\) components and assume a two-component mixture instead of three. This allows us to quickly estimate \(\pi_T\).
a: We select a gene set containing genes with small standard deviations (\(<0.1\) or \(0.5\)) for both the \(N_1\) and \(N_2\) components. Among these genes, we further select genes with \(\overline {LN}_{1g} \approx \overline {LN}_{2g}\) (mean difference \(< 0.25\) or \(0.5\)), where the \(\overline {LN}\) is the sample mean for the log2-transformed data. Within this set, we further select genes with the largest sample standard deviations of \(Y_g\) (top 250), suggesting differential expression between \(T\) and \(N\).
b: We run DeMixT in the two-component setting to estimate \(\mu_{Tg}\), \(\sigma_{Tg}^2\) and \(\pi_T\).
Stage 2
We then fix the values of \(\{\pi_T\}_i\) as derived from Stage 1, and further estimate \(\{\pi_1\}_i\) and \(\{\pi_2\}_i\) in the three-component setting.
a: We select genes with the greatest difference in the mean expression levels between the \(N_1\) and \(N_2\) components as well as those with the largest sample standard deviations of \(Y_g\) (top 250).
b: We run DeMixT in the three-component setting over the selected genes to estimate \(\pi_1\) and \(\pi_2\) given \(\pi_T\).
c: We estimate the gene-wise parameters for all genes given the fixed \(\pi\)’s. Finally, given all parameters, per gene per sample expression level, \(n_{1,ig}\), \(n_{2,ig}\) and \(t_{ig}\) are reconstituted.

Simulation study for the GSCM approach

To demonstrate the utility of GSCM for parameter estimation, we simulated a dataset with expression levels from 500 genes and 90 samples, 20 of pure \(N_1\)-type, 20 of pure \(N_2\)-type and 50 mixed samples. For the 50 mixed samples, we generated their proportions for all three components \((\pi_1, \pi_2, \pi_T) \sim Dir(1,1,1)\), where \(Dir\) is a Dirichlet distribution. For each mixed sample, we simulated expression levels of \(500\) genes for the \(N_1\) and T-component from a \(\log_2\)-normal distribution with \(\mu_{N_{1g}}\) and \(\mu_{Tg}\) from \(N_{[0, +\infty]}(7, 1.5^2)\), and with equal variance. For the \(N_2\)-component, we generated \(\mu_{N_{2g}}\) from \(\mu_{N_{2g}} + d_g\), where \(d_g \sim N_{[-0.1,0.1]}(0,1.5^2)\) for \(475\) genes (\(\hat \mu_{N1g} \approx \hat \mu_{N2g}\) ) and \(d_g \sim N_{[0.1,3]}(0,1.5^2) \cup N_{[-3,-0.1]}(0,1.5^2)\) for \(25\) genes (\(\hat \mu_{N1g} \not\approx \hat \mu_{N2g}\)). Then we mixed the \(N_1\), \(N_2\) and T-component expression levels linearly at the generated proportions according to our convolution model. We created a full matrix consisting of \(20\) \(N_1\)-type reference samples (generated separately from the \(N_1\) distribution), \(20\) \(N_2\)-type reference samples (generated separately from the \(N_2\) distribution) and \(50\) mixed samples at each simulation and repeated the simulation \(100\) times for each of the three variance values \(\sigma \in \{0.1,0.3,0.5\}\) to finally obtain \(300\) simulation repeats. We first ran DeMixT with GSCM, where we used 475 genes with simulated \(\hat \mu_{N1g} \approx \hat \mu_{N2g}\) to run the two-component deconvolution (\(N\) versus \(T\)) and used the remaining \(25\) genes to run the three-component deconvolution with estimated \(\hat{\pi}_T\). We also ran DeMixT without GSCM using all 500 genes. RMSEs between estimated and true proportions for all mixed samples were calculated for each of the two runs: with and without GSCM.

Library loading

library(MCMCpack)
library(truncnorm)
library(hydroGOF)
library(ggplot2)

Data simulation

for (i in 1:3){
for(j in 1:100){
set.seed(i)
ngenes1=475; ngenes2=25; ngenes = ngenes1 + ngenes2
nT=50 # three components
nN1=20; nN2=20
nData<-matrix(0, ngenes, nN1+nN2+nT);
TData<-matrix(0, ngenes, nT);  ## True T
TPi <- t(rdirichlet(nT, alpha = c(1,1,1)))
row.names(TPi) <- 1:3
Nmu1=7; Nsigma1=1.5# stroma1
Tmu=7; Tsigma=1.5 # tumor
SW<-c(0.1,0.3,0.5)[i]
SWN1<-SW; SWN2<-SW; SWT<-SW; 
mun1<-rtruncnorm(ngenes, a=0,mean = Nmu1, sd=Nsigma1)
mun1<-rtruncnorm(ngenes, a=0,mean = Nmu1, sd=Nsigma1)
mun2_1st <- mun1[1:ngenes1] + rtruncnorm(1, a = -0.1, b = 0.1, mean = 0, sd = 1.5)
mun2_2nd <- c()
for(l in (ngenes1+1):(ngenes1+ngenes2)){
     tmp <- mun1[l] + rtruncnorm(1, a = 0.1, b = 3, mean = 0, sd = 1.5)*(-1)^rbinom(1,size=1,prob=0.5)
     while(tmp <= 0) tmp <- mun1[l] + rtruncnorm(1, a = 0.1, b = 3, mean = 0, sd = 1.5)*(-1)^rbinom(1,size=1,prob=0.5)
     mun2_2nd <- c(mun2_2nd, tmp)
}
mun2 <- c(mun2_1st, mun2_2nd)
mut<-rtruncnorm(ngenes, a=0,mean = Tmu, sd=Tsigma)

mu <- cbind(mun1, mun2, mut)

for(k in 1:ngenes)
{
   nData[k,1:nN1] = 2^(rtruncnorm(nN1, a=0,mean=mun1[k], sd=SWN1)); #stroma 1
   nData[k,(nN1+1):(nN1+nN2)] = 2^(rtruncnorm(nN2, a=0,mean=mun2[k], sd=SWN2)); #stroma 2
   TData[k,] =  2^rtruncnorm(nT, a=0,mean=mut[k], sd=SWT);
   nData[k,(nN1+nN2+1):(nN1+nN2+nT)] = TPi[1,]*2^(rtruncnorm(nT, a=0,mean=mun1[k], sd=SWN1)) + TPi[2,]*2^(rtruncnorm(nT, a=0,mean=mun2[k], sd=SWN2)) + TPi[3,]*TData[k,]
}
ngroup = c(rep(1, nN1), rep(2, nN2), rep(3, nT))
ngroup1 = c(rep(1, nN1+nN2), rep(3,nT))
save.image(file = paste0('sim',i,j,'.rda'))
}
}

Running DeMixT with/without GSCM

##with GSCM
for(ii in 1:3){
  for(jj in 1:100){
load(paste0('sim',ii,'.rda'))
nData1 <- nData[1:ngenes1, ]
nData2<- nData[(ngenes1+1):(ngenes1+ngenes2), ]
xres1 <- DeMixT.Kernel(inputdata=nData1, groupid=ngroup1, nhavepi=0, givenpi=rep(0,nT), givenpiT=rep(0,nT))
PiT = 1 - colSums(xres1$pi)
xres2 <- DeMixT.Kernel(inputdata=nData2, groupid=ngroup, nhavepi=2, givenpi=rep(0,nT*2), givenpiT=PiT, niter=30)
save(xres1, xres2, file=paste0('x',ii,jj,'.rda'))
##without GSCM
yres <- DeMixT.S1(nData, ngroup, niter = 30, if.filter= FALSE)
save(yres, file=paste0('y',ii,jj,'.rda'))
  }
}

Supplementary Figure 1

for(iii in 1:3){
rx1 = c(); ry1 = c(); 
rx2 = c(); ry2 = c();  
rx3 = c(); ry3 = c();
for(jjj in 1:100){
  load(paste0('sim',iii,jjj,'.rda'))
  load(paste0('x',iii,jjj,'.rda'))
  load(paste0('y',iii,jjj,'.rda'))
  rx1 = c(rx1, rmse(xres2$pi[1,], TPi[1,]))
  ry1 = c(ry1, rmse(yres$pi[1,], TPi[1,]))
  rx2 = c(rx2, rmse(xres2$pi[2,], TPi[2,]))
  ry2 = c(ry2, rmse(yres$pi[2,], TPi[2,]))
  rx3 = c(rx3, rmse(1 - colSums(xres2$pi), TPi[3,]))
  ry3 = c(ry3, rmse(1 - colSums(yres$pi), TPi[3,]))
}
save.image(file = "rmse",iii,".rda")
}
sig = c("0.1","0.3","0.5")
for(iv in 1:3){
plot.new()
load(paste0("rmse",iv,".rda"))
rx = c(rx1, rx2, rx3, ry1, ry2, ry3)
lab = c(rep(1,100),rep(2,100),rep(3,100),rep(1,100),rep(2,100),rep(3,100))
clr = c(rep("red",300), rep("blue",300))
dat = cbind(rx,lab,clr)
dat = as.data.frame(dat)
dat$rx = as.numeric(as.character(dat$rx))
dat$lab = as.factor(dat$lab)
dat$clr = as.factor(dat$clr)
g1<-ggplot(dat, aes(lab, rx, fill=clr)) +
  geom_dotplot(binaxis='y', stackdir='center',dotsize = 1,
               position=position_dodge(0.25)) + xlab("") +
  geom_segment(x = 0.7, xend = 1.05, y = median(rxby1), yend = median(rxby1), linetype="dashed", color = "red") + 
  geom_segment(x = 0.95, xend = 1.3, y = median(rxbx1), yend = median(rxbx1), linetype="dashed", color = "red") + 
  geom_segment(x = 1.7, xend = 2.05, y = median(rxby2), yend = median(rxby2), linetype="dashed", color = "red") + 
  geom_segment(x = 1.95, xend = 2.3, y = median(rxbx2), yend = median(rxbx2), linetype="dashed", color = "red") + 
  geom_segment(x = 2.7, xend = 3.05, y = median(rxby3), yend = median(rxby3), linetype="dashed", color = "red") + 
  geom_segment(x = 2.95, xend = 3.3, y = median(rxbx3), yend = median(rxbx3), linetype="dashed", color = "red") +
  ylab("RMSE") +
guides(fill=FALSE)+
scale_fill_manual(values=c("#17B6F6","#F617DB"))+
annotate(geom = "text", x = 1:3,label = c(paste("pi[1]"),paste("pi[2]"),paste("pi[T]")), y = -0.015, size = 12,parse=T) +
annotate(geom = "text", x = c(0.8,1.2,1.8,2.2,2.8,3.2), label = c("w/o CM","w/ CM","w/o CM","w/ CM","w/o CM","w/ CM"), y = -0.035, size = 10) +
coord_cartesian(ylim = c(0.0, 0.5), expand = FALSE) +
ggtitle(bquote(sigma[N[1]] ~" = " ~sigma[N[2]] ~" = " ~sigma[T] ~ " = " ~ .(sig[iv])))+
  theme(
      plot.margin = unit(c(1, 1, 4, 2), "lines"),
      panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
      panel.background = element_blank(), axis.line = element_line(colour = "black"),
      axis.title.y = element_text(face='bold',size=27,vjust=1),
      axis.text.y = element_text(face='bold',size=22,color='black'),
      axis.title.x = element_blank(),
      axis.text.x = element_blank(),
      plot.title = element_text(hjust = 0.5, size =32),
      )
g2 <- ggplot_gtable(ggplot_build(g1))
g2$layout$clip[g2$layout$name == "panel"] <- "off"
grid::grid.draw(g2)
}

Data analysis

All analyses were performed using the open-source environment R (http://cran.r-project.org). Documentation (knitr-html) of all scripts is provided at the GitHub repository.

Installing DeMixT

library(devtools)
devtools::install_github("wwylab/DeMixT/DeMixT_0.1")

Summary statistics for performance evaluation

Concordance correlation coefficient (CCC)

To evaluate the performance of our method, we use the CCC and RMSE. The CCC \(\rho_{xy}\) is a measure of agreement between two variables, \(x\) and \(y\), and is defined as \(\rho_{xy} = \frac{2\rho \sigma_x \sigma_y}{\sigma^2_x + \sigma^2_y + (\mu_x - \mu_y)^2}\), where \(\mu\) and \(\sigma^2\) are the corresponding mean and variance for each variable, and \(\rho\) is the correlation coefficient between the two variables. We calculate the CCC to compare the estimated and true proportions to evaluate the proportion estimation. We also calculate the CCC to compare the deconvolved and observed expression values (\(\log_2\)-transformed).

Measure of reproducibility

To assess the reproducibility of the estimated \(\pi\) across scenarios when the different components are unknown (i.e., three scenarios for a three-component model with one unknown component), we define a statistic \(R = \frac{1}{S} \sum_{i}^{S} (\frac{1}{K-1} \sum_{k}^{K} (\epsilon_i^k - \frac{1}{K} \sum_{k}^{K} \epsilon_i^k)^2)^{\frac{1}{2}}\), where \(\epsilon_i^k = \hat \pi_i^k - \pi_i\), \(\hat \pi_i^k\) is the estimated value for the \(k\)-th scenario and \(\pi_i\) is the truth for sample \(i\). \(S\) denotes the sample size and \(K\) is the number of scenarios. This measures the variations in the estimation errors across different scenarios. We consider a method with a smaller \(R\) as more reproducible and therefore more desirable.

Mixed tissue microarray dataset

We downloaded dataset GSE19830 from the GEO browser. We used the R package to summarize the raw probe intensities with quantile normalization but without background correction as recommended in previous studies . We evaluated the performance of DeMixT with regard to tissue proportions and deconvolved expression levels on the set of genes that were selected based on the GSCM approach. Specifically, we selected genes with sample standard deviation \(<0.1\) in \(N_1\) and \(N_2\) components, among which we used those with \(\overline {LN}_{1g} - \overline {LN}_{2g}<0.25\) for running the 2-compoment model, and used the top 250 genes with largest \(\overline {LN}_{1g} - \overline {LN}_{2g}\) and largest sample standard deviation in \(Y\) for running the 3-component model. Then we ran ISOpure for the purpose of comparison.

Library loading

library(limma)
library(preprocessCore)
library(affy)
library(DeMixT)
library("hydroGOF")
library('epiR')
library('memisc')
options(digits=3)

Data generation and processing

## read in microarrary data and do a RMA normalization
data <- ReadAffy()
eset <- expresso(data, normalize.method = "quantiles", bgcorrect.method = "none", pmcorrect.method ="pmonly",summary.method="medianpolish")
write.exprs(eset,file="GSE19830_original.txt")

Supplementary Table 7

rm(list=ls(all=TRUE))
gse = read.table("GSE19830_original.txt", header = F)
head = read.table("GSE19830_title.txt", sep = "\t", header = F)
head = as.character(t(as.matrix(head)))
head = strsplit(head, " / ")
note = matrix(unlist(head), nrow = 42, byrow = T)
mat = apply(note, 2, strsplit, split = " % ")
m1 = matrix(unlist(mat[[1]]), nrow = 42, byrow = T)[,1]
m2 = matrix(unlist(mat[[2]]), nrow = 42, byrow = T)[,1]
m3 = matrix(unlist(mat[[3]]), nrow = 42, byrow = T)[,1]
purity = cbind(m1,m2,m3)
colnames(purity) = c("Liver", "Brain", "Lung")
purity = as.data.frame(purity)
# Pure tissue 
liver = which(purity$Liver == "100"); brain = which(purity$Brain == "100"); lung = which(purity$Lung == "100")
save.image(file = 'gse19830.rda')
gsepur = gse[,c(liver,brain,lung)]
gseli = gse[,liver]; gsebr = gse[,brain]; gselu = gse[,lung]
barli = apply(gseli, 1, mean); barbr = apply(gsebr, 1, mean); barlu = apply(gselu, 1, mean)
conli0 = (barli > 7); conbr0 = (barbr > 7); conlu0 = (barlu > 7)
condli = conlu0&conbr0; condbr = conli0&conlu0; condlu = conli0&conbr0
condli_new = condli; condbr_new = condbr; condlu_new = condlu
save(condli_new, file = 'condli_new.RData')
save(condbr_new, file = 'condbr_new.RData')
save(condlu_new, file = 'condlu_new.RData')
# Mixed tissue
pur = purity[-c(liver,brain,lung),]
gsemix = gse[,-c(liver,brain,lung)]
gsemix = gsemix[,1:30]
pur = pur[-c(31,32,33),]; pur = apply(pur, 2, as.numeric); pur = pur/100
#three cases for deconvolution
#1 lung and brain known, liver unknown
lubr_li <- cbind(gselu, gsebr, gsemix)
write.table(lubr_li, "lubr_lid.txt", sep='\t',row.names = FALSE, col.names=FALSE);
write.table(gsepur[,1:3], "purTrueli.txt", sep='\t',row.names = FALSE, col.names=FALSE);
#2 lung and liver known, brain unknown
lilu_br <- cbind(gseli, gselu, gsemix)
write.table(lilu_br, "lilu_brd.txt", sep='\t',row.names = FALSE, col.names=FALSE);
write.table(gsepur[,4:6], "purTruebr.txt", sep='\t',row.names = FALSE, col.names=FALSE);
#3 liver and brain known, lung unknown
libr_lu <- cbind(gseli, gsebr, gsemix)  
write.table(libr_lu, "libr_lud.txt", sep='\t',row.names = FALSE, col.names=FALSE);
write.table(gsepur[,7:9], "purTruelu.txt", sep='\t',row.names = FALSE, col.names=FALSE);
load('gse19830.rda')
li = apply(gseli,1,mean); br = apply(gsebr,1,mean);lu = apply(gselu,1,mean)
cond1 = (li/br > 1.1)|(li/br < 1/1.1) 
cond2 = (li/lu > 1.1)|(li/lu < 1/1.1) 
cond3 = (lu/br > 1.1)|(lu/br < 1/1.1)
#1,2,3 for li,br,lu
#1 mu1 = mu2 = mu3
invisible(sum((!cond1)&(!cond2)&(!cond3))/31099) ## [1] 0.611
#2 mu1 != mu2 = mu3
invisible(sum((cond1)&(cond2)&(!cond3))/31099) ## [1] 0.0782
#3 mu1 = mu2 != mu3
invisible(sum((!cond1)&(cond2)&(cond3))/31099) ## [1] 0.0627
#4 mu1 != mu2 != mu3
invisible(sum((cond1)&(cond2)&(cond3))/31099) ## [1] 0.0405
invisible(sum((cond1)&(!cond2)&(!cond3))/31099) ## [1] 0.0397
invisible(sum((!cond1)&(cond2)&(!cond3))/31099) ## [1] 0.0269
invisible(sum((!cond1)&(!cond2)&(cond3))/31099) ## [1] 0.0352

Running DeMixT

run_id <- c('lubr_lid', 'lilu_brd', 'libr_lud')
for(i in 1:3){
    name <- run_id[i]
    inputmat <-as.matrix(read.table(paste(name, ".txt", sep=""), header=FALSE ))
    inputn1m <- apply(inputmat[,1:3], 1, mean)
    inputn2m <- apply(inputmat[,4:6], 1, mean)
    #keep genes with high expression for deconvolution after linearity checking
    inputmat1 <- 2^inputmat[((inputn1m > 7)&(inputn2m > 7)), ]
    groupid <-c(rep(1, 3), rep(2, 3), rep(3, 30)); # input as 2-component
    testr1 <- DeMixT.S1(inputmat1, groupid, niter = 15, ninteg = 60, filter.option = 1, filter.criteria1 = c(0.1,0.1), filter.criteria2 = c(250,200), filter.criteria3 = 0.25, if.filter = TRUE, sg0 = 0.01, mu0 = 7.0,tol = 10^-6)
    write.table(testr1$pi, paste(name,"_out2.txt",sep=""), col.names=FALSE, row.names=FALSE, sep="\t")
    givenpi <- c(t(testr1$pi))
    testr2 <- DeMixT.S2(inputmat, groupid, givenpi, ninteg = 60)
    write.table(testr2$decovExprT, paste(name,"_out_decov.txt",sep=""), col.names=FALSE, row.names=FALSE, sep="\t")
    write.table(testr2$decovMu[,3], paste(name,"_mu.txt",sep=""), col.names=FALSE, row.names=FALSE, sep="\t")
    write.table(testr2$decovSigma[,3], paste(name,"_sigma.txt",sep=""), col.names=FALSE, row.names=FALSE, sep="\t")
}

Results summary

Supplementary Tables 3, 4 and Table 1

load("pur.Rdata")
pur =as.matrix(pur); pur = apply(pur, 2, as.numeric); pur = pur/100
Tbr <- pur[,2]; Tli <- pur[,1]; Tlu <- pur[,3]
## brain unknown
lilu_brp <- read.table("lilu_brd_out2.txt", header = F)
br_li <- as.numeric(lilu_brp[1,]); br_lu <- as.numeric(lilu_brp[2,])
br_br <- 1 - br_li - br_lu
corbr <- epi.ccc(Tbr,br_br, ci = "z-transform", conf.level = 0.95)$rho.c$est
corli <- epi.ccc(Tli,br_li, ci = "z-transform", conf.level = 0.95)$rho.c$est
corlu <- epi.ccc(Tlu,br_lu, ci = "z-transform", conf.level = 0.95)$rho.c$est
print(epi.ccc(Tbr,br_br, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.879 0.796  0.93
print(epi.ccc(Tlu,br_lu, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.951 0.914 0.972
print(epi.ccc(Tli,br_li, ci = "z-transform", conf.level = 0.95)$rho.c)
##    est lower upper
## 1 0.74 0.615 0.829
print(round(corbr,2)); print(round(corlu,2)); print(round(corli,2))
## [1] 0.88
## [1] 0.95
## [1] 0.74
iso_br <- as.numeric(as.matrix(read.table("lilubr_alphapurities_new.txt", header = F)))
iso_li <- apply(as.matrix(read.table("lilubr_theta_new.txt", header = F))[,1:3], 1, sum)
iso_lu <- apply(as.matrix(read.table("lilubr_theta_new.txt", header = F))[,4:6], 1, sum)
# load results from ISOpure
iso_corbr <- epi.ccc(Tbr,iso_br, ci = "z-transform", conf.level = 0.95)$rho.c$est
iso_corlu <- epi.ccc(Tlu,iso_lu, ci = "z-transform", conf.level = 0.95)$rho.c$est
iso_corli <- epi.ccc(Tli,iso_li, ci = "z-transform", conf.level = 0.95)$rho.c$est
print(epi.ccc(Tbr,iso_br, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.681 0.544 0.783
print(epi.ccc(Tlu,iso_lu, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.996 0.993 0.998
print(epi.ccc(Tli,iso_li, ci = "z-transform", conf.level = 0.95)$rho.c)
##   est lower upper
## 1 0.7 0.557 0.803
print(round(iso_corbr,2)); print(round(iso_corlu,2)); print(round(iso_corli,2))
## [1] 0.68
## [1] 1
## [1] 0.7
print(round(rmse(Tbr,br_br),2)); print(round(rmse(Tlu,br_lu),2)); print(round(rmse(Tli,br_li),2))
## [1] 0.08
## [1] 0.06
## [1] 0.13
print(round(rmse(Tbr,iso_br),2)); print(round(rmse(Tlu,iso_lu),2)); print(round(rmse(Tli,iso_li),2))
## [1] 0.18
## [1] 0.02
## [1] 0.17
#calculate error for each value
#DeMixT
b.br <- br_br - Tbr; b.lu <- br_lu - Tlu; b.li <- br_li - Tli
#ISOpure
ib.br <- iso_br - Tbr; ib.lu <- iso_lu - Tlu; ib.li <- iso_li - Tli
## lung unknown
libr_lup <- read.table("libr_lud_out2.txt", header = F)
lu_li <- as.numeric(libr_lup[1,]); lu_br <- as.numeric(libr_lup[2,])
lu_lu <- 1 - lu_li - lu_br
corbr <- epi.ccc(Tbr,lu_br, ci = "z-transform", conf.level = 0.95)$rho.c$est
corli <- epi.ccc(Tli,lu_li, ci = "z-transform", conf.level = 0.95)$rho.c$est
corlu <- epi.ccc(Tlu,lu_lu, ci = "z-transform", conf.level = 0.95)$rho.c$est
print(epi.ccc(Tbr,lu_br, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.838 0.715  0.91
print(epi.ccc(Tlu,lu_lu, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.971 0.949 0.983
print(epi.ccc(Tli,lu_li, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.753 0.628  0.84
print(round(corbr,2)); print(round(corlu,2)); print(round(corli,2))
## [1] 0.84
## [1] 0.97
## [1] 0.75
iso_lu <- as.numeric(as.matrix(read.table("librlu_alphapurities_new.txt", header = F)))
iso_li <- apply(as.matrix(read.table("librlu_theta_new.txt", header = F))[,1:3], 1, sum)
iso_br <- apply(as.matrix(read.table("librlu_theta_new.txt", header = F))[,4:6], 1, sum)
# load results from ISOpure
iso_corbr <- epi.ccc(Tbr,iso_br, ci = "z-transform", conf.level = 0.95)$rho.c$est
iso_corlu <- epi.ccc(Tlu,iso_lu, ci = "z-transform", conf.level = 0.95)$rho.c$est
iso_corli <- epi.ccc(Tli,iso_li, ci = "z-transform", conf.level = 0.95)$rho.c$est
print(epi.ccc(Tbr,iso_br, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.971 0.941 0.986
print(epi.ccc(Tlu,iso_lu, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.736 0.611 0.825
print(epi.ccc(Tli,iso_li, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.838 0.748 0.898
print(round(iso_corbr,2)); print(round(iso_corlu,2)); print(round(iso_corli,2))
## [1] 0.97
## [1] 0.74
## [1] 0.84
print(round(rmse(Tbr,lu_br),2)); print(round(rmse(Tlu,lu_lu),2)); print(round(rmse(Tli,lu_li),2))
## [1] 0.1
## [1] 0.05
## [1] 0.13
print(round(rmse(Tbr,iso_br),2)); print(round(rmse(Tlu,iso_lu),2)); print(round(rmse(Tli,iso_li),2))
## [1] 0.04
## [1] 0.14
## [1] 0.11
# calculate error for each value
# DeMixT
u.br <- lu_br - Tbr; u.lu <- lu_lu - Tlu; u.li <- lu_li - Tli
# ISOpure
iu.br <- iso_br - Tbr; iu.lu <- iso_lu - Tlu; iu.li <- iso_li - Tli
## liver unknown
lubr_lip <- read.table("lubr_lid_out2.txt", header = F)
li_lu <- as.numeric(lubr_lip[1,]); li_br <- as.numeric(lubr_lip[2,])
li_li <- 1 - li_lu - li_br
corbr <- epi.ccc(Tbr,li_br, ci = "z-transform", conf.level = 0.95)$rho.c$est
corli <- epi.ccc(Tli,li_li, ci = "z-transform", conf.level = 0.95)$rho.c$est
corlu <- epi.ccc(Tlu,li_lu, ci = "z-transform", conf.level = 0.95)$rho.c$est
print(epi.ccc(Tbr,li_br, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.775 0.654 0.857
print(epi.ccc(Tlu,li_lu, ci = "z-transform", conf.level = 0.95)$rho.c)
##    est lower upper
## 1 0.96 0.941 0.974
print(epi.ccc(Tli,li_li, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.744  0.62 0.831
print(round(corbr,2)); print(round(corlu,2)); print(round(corli,2))
## [1] 0.77
## [1] 0.96
## [1] 0.74
# load results from ISOpure
iso_li <- as.numeric(as.matrix(read.table("lubrli_alphapurities_new.txt", header = F)))
iso_lu <- apply(as.matrix(read.table("lubrli_theta_new.txt", header = F))[,1:3], 1, sum)
iso_br <- apply(as.matrix(read.table("lubrli_theta_new.txt", header = F))[,4:6], 1, sum)
iso_corbr <- epi.ccc(Tbr,iso_br, ci = "z-transform", conf.level = 0.95)$rho.c$est
iso_corlu <- epi.ccc(Tlu,iso_lu, ci = "z-transform", conf.level = 0.95)$rho.c$est
iso_corli <- epi.ccc(Tli,iso_li, ci = "z-transform", conf.level = 0.95)$rho.c$est
print(epi.ccc(Tbr,iso_br, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.933 0.884 0.962
print(epi.ccc(Tlu,iso_lu, ci = "z-transform", conf.level = 0.95)$rho.c)
##     est lower upper
## 1 0.981 0.961 0.991
print(epi.ccc(Tli,iso_li, ci = "z-transform", conf.level = 0.95)$rho.c)
##    est lower upper
## 1 0.98 0.964 0.989
print(round(iso_corbr,2)); print(round(iso_corlu,2)); print(round(iso_corli,2))
## [1] 0.93
## [1] 0.98
## [1] 0.98
print(round(rmse(Tbr,li_br),2)); print(round(rmse(Tlu,li_lu),2)); print(round(rmse(Tli,li_li),2))
## [1] 0.12
## [1] 0.05
## [1] 0.13
print(round(rmse(Tbr,iso_br),2)); print(round(rmse(Tlu,iso_lu),2)); print(round(rmse(Tli,iso_li),2))
## [1] 0.07
## [1] 0.04
## [1] 0.04
#calculate error for each value
# DeMixT
i.br <- li_br - Tbr; i.lu <- li_lu - Tlu; i.li <- li_li - Tli
# ISOpure
ii.br <- iso_br - Tbr; ii.lu <- iso_lu - Tlu; ii.li <- iso_li - Tli
##we assume a summary statistics by calucalting sum(sd(error))/n
#DeMixT
dm.br <- cbind(b.br, u.br, i.br); dm.lu <- cbind(b.lu, u.lu, i.lu); dm.li <- cbind(b.li, u.li, i.li)
dm.brs <- apply(dm.br, 1, sd); dm.brs <- mean(dm.brs)
dm.lus <- apply(dm.lu, 1, sd); dm.lus <- mean(dm.lus)
dm.lis <- apply(dm.li, 1, sd); dm.lis <- mean(dm.lis)
print(c(dm.brs, dm.lus, dm.lis))
## [1] 0.0327 0.0324 0.0253
#ISOpure
iso.br <- cbind(ib.br, iu.br, ii.br); iso.lu <- cbind(ib.lu, iu.lu, ii.lu); iso.li <- cbind(ib.li, iu.li, ii.li)
iso.brs <- apply(iso.br, 1, sd); iso.brs <- mean(iso.brs)
iso.lus <- apply(iso.lu, 1, sd); iso.lus <- mean(iso.lus)
iso.lis <- apply(iso.li, 1, sd); iso.lis <- mean(iso.lis)
print(c(iso.brs, iso.lus, iso.lis))
## [1] 0.0961 0.0830 0.0664

Supplementary Figure 3

## lilu_br
lilu_brp <- read.table("lilu_brd_out2.txt", header = F)
br_li <- as.numeric(lilu_brp[1,]); br_lu <- as.numeric(lilu_brp[2,])
br_br <- 1 - br_li - br_lu
isobr_br <- as.numeric(as.matrix(read.table("lilubr_alphapurities_new.txt", header = F)))
isobr_li <- apply(as.matrix(read.table("lilubr_theta_new.txt", header = F))[,1:3], 1, sum)
isobr_lu <- apply(as.matrix(read.table("lilubr_theta_new.txt", header = F))[,4:6], 1, sum)
## libr_lu
libr_lup <- read.table("libr_lud_out2.txt", header = F)
lu_li <- as.numeric(libr_lup[1,]); lu_br <- as.numeric(libr_lup[2,])
lu_lu <- 1 - lu_li - lu_br
isolu_lu <- as.numeric(as.matrix(read.table("librlu_alphapurities_new.txt", header = F)))
isolu_li <- apply(as.matrix(read.table("librlu_theta_new.txt", header = F))[,1:3], 1, sum)
isolu_br <- apply(as.matrix(read.table("librlu_theta_new.txt", header = F))[,4:6], 1, sum)
## lubr_li
lubr_lip <- read.table("lubr_lid_out2.txt", header = F)
li_lu <- as.numeric(lubr_lip[1,]); li_br <- as.numeric(lubr_lip[2,])
li_li <- 1 - li_lu - li_br
isoli_li <- as.numeric(as.matrix(read.table("lubrli_alphapurities_new.txt", header = F)))
isoli_lu <- apply(as.matrix(read.table("lubrli_theta_new.txt", header = F))[,1:3], 1, sum)
isoli_br <- apply(as.matrix(read.table("lubrli_theta_new.txt", header = F))[,4:6], 1, sum)
########liver########
li_prop1 <- c(pur[,1], pur[,1], pur[,1], pur[,1], pur[,1], pur[,1])
li_prop2 <- c(li_li, lu_li, br_li, isoli_li, isolu_li, isobr_li)
li_prop <- data.frame(Truth = li_prop1, Estimates = li_prop2, 
Tool = c(rep("DeMixT(Liver Unknown)", 30), rep("DeMixT(Lung Unknown)", 30),rep("DeMixT(Brain Unknown)", 30),
rep("ISOpure(Liver Unknown)", 30),rep("ISOpure(Lung Unknown)", 30), rep("ISOpure(Brain Unknown)", 30)),
Tissue = c(rep("Liver Unknown", 30), rep("Lung Unknown", 30),rep("Brain Unknown", 30), 
rep("Liver Unknown", 30),rep("Lung Unknown", 30), rep("Brain Unknown", 30)),
Dec = c(rep("DeMixT", 90), rep("ISOpure", 90)))
########brain########
br_prop1 <- c(pur[,2], pur[,2], pur[,2], pur[,2], pur[,2], pur[,2])
br_prop2 <- c(li_br, lu_br, br_br, isoli_br, isolu_br, isobr_br)
br_prop <- data.frame(Truth = br_prop1, Estimates = br_prop2, 
Tool = c(rep("DeMixT(Liver Unknown)", 30), rep("DeMixT(Lung Unknown)", 30),rep("DeMixT(Brain Unknown)", 30),
rep("ISOpure(Liver Unknown)", 30),rep("ISOpure(Lung Unknown)", 30), rep("ISOpure(Brain Unknown)", 30)),
Tissue = c(rep("Liver Unknown", 30), rep("Lung Unknown", 30),rep("Brain Unknown", 30), 
rep("Liver Unknown", 30),rep("Lung Unknown", 30), rep("Brain Unknown", 30)),
Dec = c(rep("DeMixT", 90), rep("ISOpure", 90)))
########lung########
lu_prop1 <- c(pur[,3], pur[,3], pur[,3], pur[,3], pur[,3], pur[,3])
lu_prop2 <- c(li_lu, lu_lu, br_lu, isoli_lu, isolu_lu, isobr_lu)
lu_prop <- data.frame(Truth = lu_prop1, Estimates = lu_prop2, 
Tool = c(rep("DeMixT(Liver Unknown)", 30), rep("DeMixT(Lung Unknown)", 30),rep("DeMixT(Brain Unknown)", 30),
rep("ISOpure(Liver Unknown)", 30),rep("ISOpure(Lung Unknown)", 30), rep("ISOpure(Brain Unknown)", 30)),
Tissue = c(rep("Liver Unknown", 30), rep("Lung Unknown", 30),rep("Brain Unknown", 30), 
rep("Liver Unknown", 30),rep("Lung Unknown", 30), rep("Brain Unknown", 30)),
Dec = c(rep("DeMixT", 90), rep("ISOpure", 90)))
########################################making plot########################################
plot( jitter(lu_prop$Truth*100,1), lu_prop$Estimates*100, main = "Proportions for lung",
col = ifelse(lu_prop$Dec == "ISOpure", "black", "blue"),
pch = cases((lu_prop$Tissue == "Liver Unknown") -> 3, (lu_prop$Tissue == "Lung Unknown") -> 1,
(lu_prop$Tissue == "Brain Unknown") -> 2),
xlim = c(0,100), ylim=c(0,100), ylab = "Estimates (%)", xlab = "Truth (%)")
legend("topleft", c("ISOpure", "DeMixT"),text.col=c("black", "blue"), merge=FALSE, cex = 1.5,bty = "n")
legend("bottomright", c("liver unknown","lung unknown","brain unknown"), 
pch=c(3,1,2), merge=FALSE, cex = 1.5,bty = "n")
abline(0,1,col= "red",lwd = 3)
box(lwd=5)