Overall Survival Curves for TCGA and Tothill by RD Status 
========================================================

by Susan L. Tucker

```{r options, echo=TRUE}
opts_chunk$set(tidy=TRUE, message=TRUE)
```

## 1 Executive Summary

### 1.1 Introduction

The goal of this analysis is to produce Kaplan-Meier curves of overall survival (OS) by residual disease (RD) status for patients included in TCGA and Tothill et al.

### 1.2 Data \& Methods

We use the RData objects containing clinical information created in previous reports (assembleTCGAClinical, assembleTothillClinical). Patients are filtered as described previously (filterTCGASamples, filterTothillSamples). Additional patients are excluded for whom survival information is missing. 

Survival times are converted from months to years for the data of Tothill et al.

Kaplan-Meier plots are produced to illustrate OS in patient cohorts. OS is compared between groups using the log-rank test.

Comparisons considered are:

i) TCGA versus Tothill et al.

ii) Within each dataset by the RD categories provided in the original data sources.

iii) Within each dataset, any RD compared to no RD.

iv) Within each dataset, by FABP4 expression.

### 1.3 Results

Three patients are excluded from the filtered cohort of Tothill et al. because of missing survival information.

OS is essentially identical in TCGA and Tothill et al.

Within each data set, OS differs significantly by RD status, using both the RD categories provided or comparing any RD to no RD.

In each data set, OS is worse among the 25% of patients with the highest expression levels of FABP4. The difference reaches statistical significance in Tothill et al.

## 2 Loading \& Filtration of Data

The data objects are loaded. 

```{r loadData}

load(file.path("RDataObjects","tcgaClinical.RData"))
load(file.path("RDataObjects","tcgaFilteredSamples.RData"))
load(file.path("RDataObjects","tcgaExpression.RData"))

load(file.path("RDataObjects","tothillClinical.RData"))
load(file.path("RDataObjects","tothillFilteredSamples.RData"))
load(file.path("RDataObjects","tothillExpression.RData"))

```

Filtrations are applied to the TCGA data. 

```{r tcgaFilter}

rownames(tcgaFilteredSamples)[1:2]
rownames(tcgaClinical)[1:2]
rownames(tcgaOSYrs)[1:2]
colnames(tcgaExpression[,1:2])

tcgaSampleUseLong <- rownames(tcgaFilteredSamples[which(tcgaFilteredSamples[,"sampleUse"]=="Used"),])
tcgaSampleUse <- substr(tcgaSampleUseLong,1,12)
length(tcgaSampleUse)
length(unique(tcgaSampleUse))

tcgaOSYrsUse <- tcgaOSYrs[tcgaSampleUse,]
summary(tcgaOSYrsUse)
tcgaClinUse <- tcgaClinical[tcgaSampleUse,]
tcgaRDUse <- tcgaRD[tcgaSampleUse]
table(tcgaRDUse)

tcgaExpressionUse <- tcgaExpression[,tcgaSampleUseLong]
colnames(tcgaExpressionUse) <- tcgaSampleUse

```

Filtrations are applied to the data of Tothill et al. and survival times are converted from months to years.

```{r tothillFilter}

rownames(tothillFilteredSamples)[1:2]
rownames(tothillClinical)[1:2]
rownames(tothillOSMos)[1:2]
colnames(tothillExpression[,1:2])

tothillSampleUseTmp <- rownames(tothillFilteredSamples[which(tothillFilteredSamples[,"sampleUse"]=="Used"),])
length(tothillSampleUseTmp)
summary(tothillOSMos[tothillSampleUseTmp,])

tothillSampleUse <- intersect(tothillSampleUseTmp,rownames(tothillOSMos[!is.na(tothillOSMos[,1]),]))
length(tothillSampleUse)

tothillOSYrsUse <- tothillOSMos[tothillSampleUse,]
tothillOSYrsUse[,1] <- tothillOSYrsUse[,1]/12
tothillClinUse <- tothillClinical[tothillSampleUse,]
tothillRDUse <- tothillRD[tothillSampleUse]
table(tothillRDUse)

tothillExpressionUse <- tothillExpression[,tothillSampleUse]

```

## 3 Analyses

Overall survival is compared in TCGA versus Tothill et al.

```{r compareOS}

tmp <- rbind(tcgaOSYrsUse, tothillOSYrsUse)

library(survival)
osAll <- Surv(tmp[, 1], tmp[, 2] == 1)

cohort <- rep(2, dim(osAll)[1])
cohort[1:dim(tcgaOSYrsUse)[1]] <- 1
table(cohort)

fit <- survfit(osAll ~ cohort)
survdiff(osAll ~ cohort)

plot(fit, lty = c(1, 2), xlab = "Years", ylab = "Overall Survival", lwd = 2, 
    main = "Overall Survival in TCGA versus Tothill")
legend(x = 8, y = 0.95, legend = c("TCGA (N=491)", "Tothill (N=186)"), lty = c(1, 2), lwd = 2)
text(11, 0.7, "P = 0.975")

```

Overall survival by residual disease status is plotted for the TCGA data.

```{r kmTCGA}

table(tcgaClinUse$tumor_residual_disease)
tcgaGp <- rep(1,dim(tcgaOSYrsUse)[1])
tcgaGp[which(tcgaClinUse[,"tumor_residual_disease"] == "1-10 mm")] <- 2
tcgaGp[which(tcgaClinUse[,"tumor_residual_disease"] == "11-20 mm")] <- 3
tcgaGp[which(tcgaClinUse[,"tumor_residual_disease"] == ">20 mm")] <- 4

survTCGA <- Surv(tcgaOSYrsUse[,1], tcgaOSYrsUse[,2] == 1)
tcgaSurvFit <- survfit(survTCGA ~ tcgaGp)
survdiff(survTCGA ~ tcgaGp)

plot(tcgaSurvFit, lty=1:4, xlab = "Years after Surgery", ylab = "Proportion Surviving", lwd=2)
legend(x = 5, y = 0.98, legend = c("No macroscopic disease (N=113)", "1-10 mm (N=242)", "11-20 mm (N=34)", ">20 mm (N=102)"), lty = c(1:4), lwd=2, cex=0.8)
text(0.05,0.05,"(A) TCGA",pos=4)
text(7,0.6,"P = 0.0002",pos=4)

```

Overall survival by residual disease status is plotted for the data of Tothill et al.

```{r kmTothill}

table(tothillClinUse$ResidDisease)
tothillGp <- rep(1,dim(tothillOSYrsUse)[1])
tothillGp[which(tothillClinUse[,"ResidDisease"] == "<1")] <- 2
tothillGp[which(tothillClinUse[,"ResidDisease"] == ">1")] <- 3
tothillGp[which(tothillClinUse[,"ResidDisease"] == "macro size NK")] <- 4

survTothill <- Surv(tothillOSYrsUse[,1], tothillOSYrsUse[,2] == 1)
tothillSurvFit <- survfit(survTothill ~ tothillGp)
survdiff(survTothill ~ tothillGp)

plot(tothillSurvFit, lty=1:4, xlab = "Years after Surgery", ylab = "Proportion Surviving", lwd=2)
legend(x = 7, y = 0.98, legend = c("nil (N=50)", "<1 (N=66)", ">1 (N=57)", "macro size NK (N=13)"), lty = c(1:4), lwd=2, cex=0.8)
text(0.05,0.05,"(B) Tothill et al.",pos=4)
text(8,0.6,"P = 0.0217",pos=4)

```

For each data set, patients with any RD are compared to patients without RD. We do this first for the TCGA data.

```{r kmTCGArdVSnoRD}

table(tcgaRDUse)
tcgaSurvFit <- survfit(survTCGA ~ tcgaRDUse)
survdiff(survTCGA ~ tcgaRDUse)

plot(tcgaSurvFit, lty=1:4, xlab = "Years after Surgery", ylab = "Proportion Surviving", lwd=2)
legend(x = 6, y = 0.98, legend = c("No RD (N=113)", "Any RD (N=378)"), lty = c(1:4), lwd=2, cex=0.8)
text(0.05,0.05,"(A) TCGA",pos=4)
text(7,0.6,"P < 0.0001",pos=4)

```

We next do this for the data of Tothill et al.

```{r kmTothilRdVSnoRD}

table(tothillRDUse)
tothillSurvFit <- survfit(survTothill ~ tothillRDUse)
survdiff(survTothill ~ tothillRDUse)

plot(tothillSurvFit, lty=1:4, xlab = "Years", ylab = "Proportion Surviving", lwd=2)
legend(x = 6, y = 0.98, legend = c("No RD (N=50)", "Any RD (N=136)"), lty = c(1:4), lwd=2, cex=0.8)
text(0.05,0.05,"(B) Tothill et al.",pos=4)
text(7,0.6,"P  0.0022",pos=4)

```

We produce the TCGA plot for the manuscript.

```{r kmTCGArdVSnoRDms}

table(tcgaRDUse)
tcgaSurvFit <- survfit(survTCGA ~ tcgaRDUse)
survdiff(survTCGA ~ tcgaRDUse)

plot(tcgaSurvFit, lty=1:4, xlab = "Years after Surgery", ylab = "Proportion Surviving", lwd=2)
legend(x = 6, y = 0.98, legend = c("No RD (N=113)", "Any RD (N=378)"), lty = c(1:4), lwd=2, cex=0.8)
text(0.05,0.05,"(A) TCGA",pos=4)
text(7,0.6,"P < 0.001",pos=4)

```

We do the same thing for Tothill et al.

```{r kmTothilRdVSnoRDms}

table(tothillRDUse)
tothillSurvFit <- survfit(survTothill ~ tothillRDUse)
survdiff(survTothill ~ tothillRDUse)

plot(tothillSurvFit, lty=1:4, xlab = "Years", ylab = "Proportion Surviving", lwd=2)
legend(x = 6, y = 0.98, legend = c("No RD (N=50)", "Any RD (N=136)"), lty = c(1:4), lwd=2, cex=0.8)
text(0.05,0.05,"(B) Tothill et al.",pos=4)
text(7,0.6,"P  0.002",pos=4)

```

We also look at OS in each data set for patients with FABP4 in the top 25% compared to the lower 75%. We begin with TCGA.

```{r kmTCGAfabp4}

probeNames <- rownames(tcgaExpressionUse)

library(hthgu133a.db)
geneNames <- unlist(mget(probeNames, hthgu133aSYMBOL))
probesFABP4 <- probeNames[which(geneNames == "FABP4")]
probesFABP4

tcgaFABP4 <- tcgaExpressionUse[probesFABP4,]

tcgaFABP4Gp <- rep(0,length(tcgaFABP4))
tcgaFABP4Gp[tcgaFABP4 > quantile(tcgaFABP4, probs = c(.75))] <- 1
table(tcgaFABP4Gp)

tcgaSurvFit <- survfit(survTCGA ~ tcgaFABP4Gp)
survdiff(survTCGA ~ tcgaFABP4Gp)

plot(tcgaSurvFit, lty=1:4, xlab = "Years", ylab = "Proportion Surviving", lwd=2)
legend(x = 6, y = 0.98, legend = c("Low FABP4 (N=368)", "High FABP4 (N=123)"), lty = c(1,2), lwd=2, cex=0.8)
text(0.05,0.05,"(A) TCGA",pos=4)
text(7,0.6,"P  0.139",pos=4)

```

We repeat, using the data of Tothill et al.

```{r kmTothillFabp4}

tothillFABP4 <- tothillExpressionUse[probesFABP4,]

tothillFABP4Gp <- rep(0,length(tothillFABP4))
tothillFABP4Gp[tothillFABP4 > quantile(tothillFABP4, probs = c(.75))] <- 1
table(tothillFABP4Gp)

tothillSurvFit <- survfit(survTothill ~ tothillFABP4Gp)
survdiff(survTothill ~ tothillFABP4Gp)

plot(tothillSurvFit, lty=1:2, xlab = "Years", ylab = "Proportion Surviving", lwd=2)
legend(x = 6, y = 0.98, legend = c("Low FABP4 (N=139)", "High FABP4 (N=47)"), lty = c(1:2), lwd=2, cex=0.8)
text(0.05,0.05,"(B) Tothill et al.",pos=4)
text(7,0.6,"P  0.0025",pos=4)

```

## 4 Appendix

### 4.1 File Location

```{r getLocation}
getwd()
```

### 4.2 SessionInfo

```{r sessionInfo}
sessionInfo()
```