Survival Analysis (DB 10)

Introduction
Survivor functions and hazard functions
Kaplan-Meier estimate
Estimation
- Exponential model
- Weibull model
Model checking

Display system information and load tidyverse.

sessionInfo()

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.13       digest_0.6.25  
## [13] rlang_0.4.5     evaluate_0.14

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

This lecture note draws heavily on the Chapter 10 of textbook An Introduction to Generalized Linear Models by Dobson and Barnett (2nd or 3rd edition). ULCA library link to the 2nd edition.

Introduction

Survival model studies the time from a well-defined starting point until some event, called “failure”, occurs. In other words, survial models studies time-to-event.
Survival outcomes, or survival times, have at least two important features:
1. the times are non-negative and typically have skewed distributions with long tails.
2. censoring: some subjects may survive beyond the study period so their actual failure times are unknown.
In following diagram, subject 3 is right-censored, subject 4 is right-censored due to loss to follow up, and subject 5 is left-censored.

Survivor functions and hazard functions

Let $Y=$ survival time, and $f(y)$ be its density. The probability of failure before a specific time $y$ is given by the cummulative distribution function (cdf) \[ F(y) = \mathbb{P}(Y \le y) = \int_0^y f(t) \, dt. \]
The survivor function is the probability of survival beyond time $y$ \[ S(y) = \mathbb{P}(Y > y) = 1 - F(y). \]
The hazard function is the probability of death in an infinitesimal small interval $[y, y + \Delta y]$, given surivival up to time $y$ \[\begin{eqnarray*} h(y) &=& \lim_{\Delta y \to 0} \frac{\mathbb{P}(y < Y \le y + \Delta y \mid Y > y)}{\Delta y} \\ &=& \lim_{\Delta y \to 0} \frac{F(y + \Delta y) - F(y)}{\Delta y} \times \frac{1}{S(y)} \\ &=& \frac{f(y)}{S(y)}. \end{eqnarray*}\]
Rewrite the hazard function as \[ h(y) = - \frac{\operatorname{d}}{\operatorname{d} y} \log S(y). \] Then we have \[ S(y) = e^{-H(y)} = e^{- \int_0^y h(t) \, dt} \] or \[ H(y) = - \log S(y). \] $H(y)$ is called the cumulative hazard function or the integrated hazard function.
The median survival time, $y(50)$, is given by the solution of $F(y) = 1/2$.

Exponential distribution

A simple model for $Y$ is the exponential distribution with density \[ f(y; \theta) = \theta e^{-\theta y}, \quad y \ge 0, \theta > 0 \] and moments $\mathbb{E}Y = \theta^{-1}$, $\operatorname{Var} Y = \theta^{-2}$.

y <- seq(0, 2.5, by = 0.01)
theta <- 2
exp_df <- tibble(y = y,
                 f = dexp(y, theta),
                 F = pexp(y, theta),
                 S = 1 - F,
                 h = theta,
                 H = theta * y)
exp_df %>%  
  ggplot() + 
  geom_line(mapping = aes(x = y, y = f)) + 
  labs(x = "y", y = "f(y)", title = "Exponential density")

The cdf of exponential is \[ F(y; \theta) = \int_0^y \theta e^{-\theta t} \, dt = 1 - e^{-\theta y}. \]
The survivor function is \[ S(y,; \theta) = e^{-\theta y}. \]

exp_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = S)) + 
  labs(x = "y", y = "S(y)", title = "Survivor function of exponential")

The hazard function is \[ h(y; \theta) = \frac{f(y; \theta)}{S(y; \theta)} = \theta. \] The hazard function does not depent on $y$. This lack of memory property may be a limitation because in many applications the probability of failure often increases with time.

exp_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = h)) + 
  labs(x = "y", y = "h(y)", title = "Hazard function of exponential")

The cumulative hazard function is \[ H(y; \theta) = \theta y. \]

exp_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = H)) + 
  labs(x = "y", y = "H(y)", title = "Cumulative hazard function of exponential")

Proportional hazards model

For the exponential distribution, to ensure $\theta>0$, it is common to use the log link \[ \theta = e^{\mathbf{x}^T \boldsymbol{\beta}}. \] Then the hazard function has the multiplicative form \[ h(y; \boldsymbol{\beta}) = \theta = e^{\mathbf{x}^T \boldsymbol{\beta}}. \] A unit change in $x_k$ will result in a hazard ratio or relative hazard \[ \frac{h_1(y; \boldsymbol{\beta})}{h_0(y; \boldsymbol{\beta})} = e^{\beta_k}. \]
Any model of the form \[ h_1(y) = h_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}} \] is called a proportional hazards model. $h_0(y)$, which is hazard function corresponding to the reference levels for all the explanatory variables, is called the baseline hazard.
For proportional hazards models, the cumulative hazard function is given by \[ H_1(y) = \int_0^y h_1(t) \, dt = \int_0^y h_0(t) e^{\mathbf{x}^T \boldsymbol{\beta}} \, dt = H_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}}, \] so \[ \log H_1(y) = \log H_0(y) + \mathbf{x}^T \boldsymbol{\beta}. \]

Weibull distribution

Another commonly used model for survival times is the Weibull distribution with density \[ f(y; \lambda, \theta) = \frac{\lambda y^{\lambda - 1}}{\theta^\lambda} e^{-(y / \theta)^\lambda}, \quad y \ge 0, \lambda > 0, \theta > 0, \] where $\lambda$ and $\theta$ are shape and scale parameters respectively.

y <- seq(0, 2.5, by = 0.01)
lambda <- 5 # shape
theta <- 1 # scale
phi = exp(-lambda * log(theta))
wb_df <- tibble(y = y,
                 f = dweibull(y, lambda, theta),
                 F = pweibull(y, lambda, theta),
                 S = 1 - F,
                 h = lambda * phi * exp((lambda - 1 ) * log(y)),
                 H = phi * exp(lambda * log(y)))
wb_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = f)) + 
  labs(x = "y", y = "f(y)", title = "Weibull density")

With reparameterization $\theta^{-\lambda} = \phi$, the density is \[ f(y; \lambda, \phi) = \lambda \phi y^{\lambda - 1} e^{-\phi y^{\lambda}}. \] Exponential distribution is a special case of Weibull with $\lambda = 1$.
The survivor function for Weibull is \[ S(y; \lambda, \phi) = \int_y^\infty \lambda \phi u^{\lambda - 1} e^{-\phi u^{\lambda}} \, du = e^{-\phi y^{\lambda}}. \]

wb_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = S)) + 
  labs(x = "y", y = "S(y)", title = "Survivor functiion of Weibull")

The hazard function is \[ h(y; \lambda, \phi) = \frac{f(y; \lambda, \phi)}{S(y; \lambda, \phi)} = \lambda \phi y^{\lambda - 1}. \]

For $\lambda > 1$, the hazard function depends on $y$ and yields increasing Weibull or accelerated failure time models. Such model might be expected for leukemia patients not responding to treatment, where the event of interest is death. As survival time increases for such a patient, and as the prognosis accordingly worsens, the patient’s potential for dying of the disease also increases.

For $\lambda < 1$, the hazard function depends on $y$ and yields decreasing Weibull models. Such model might be expected when the event is death in persons who are recovering from surgery, because the potential for dying after surgery usually decreases as the time after surgery increases.

wb_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = h)) + 
  labs(x = "y", y = "h(y)", title = "Hazard functiion of Weibull")

The cumulative hazard function is \[ H(y; \lambda, \phi) = \phi y^{\lambda}. \]

wb_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = H)) + 
  labs(x = "y", y = "f(y)", title = "Cummulative hazard functiion of Weibull")

The mean of Weibull is \[ \mathbb{E} Y = \int_0^\infty \lambda \phi y^{\lambda} e^{-\phi y^\lambda} \, dy = \phi^{-1/\lambda} \Gamma(1 + 1/\lambda). \] The median of Weibull is \[ y(50) = \phi^{-1/\lambda} (\log 2)^{1/\lambda}. \]
We model relationship between $Y$ and predictors in terms of $\phi$ \[ \phi = \alpha e^{\mathbf{x}^T \boldsymbol{\beta}}. \] Then the hazard function is \[ h(y; \lambda, \phi) = \lambda \alpha y^{\lambda - 1} e^{\mathbf{x}^T \boldsymbol{\beta}}. \] If $h_0(y)$ is the baseline hazard function corresponding to reference levels of all predictors, then \[ h(y) = h_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}}, \] which is a proportional hazards model.

Kaplan-Meier estimate

The empirical survivor function, an estimate of the probability of survival beyond time $y$, is \[ \widehat{S}(y) = \widehat{\mathbb{P}}(Y > y). \]
Arrange the even times $y_{(1)} \le y_{(2)} \le \cdots \le y_{(k)}$. Kaplan-Meier formula or product limit formula \[\begin{eqnarray*} \widehat{S}(y_{k}) &=& \widehat{\mathbb{P}}(Y > y_{(k)}) \\ &=& \widehat{\mathbb{P}}(Y > y_{(k-1)}) \times \widehat{\mathbb{P}}(Y > y_{(k)} \mid Y > y_{(k-1)}) \\ &=& \widehat{S}(y_{k-1}) \times \widehat{\mathbb{P}}(Y > y_{(k)} \mid Y > y_{(k-1)}) \\ &=& \cdots \\ &=& \prod_{j=1}^k \widehat{\mathbb{P}}(Y > y_{(j)} \mid Y > y_{(j-1)}) \\ &=& \prod_{j=1}^k \left( \frac{n_j - d_j}{n_j} \right), \end{eqnarray*}\] where \[\begin{eqnarray*} n_j &=& \text{number of subjects who are alive just before time $y_{(j)}$}, \\ d_j &=& \text{number of deaths that occur at time $y_{(j)}$}. \end{eqnarray*}\]
The leukemia data gehan. time is the remission time in weeks.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(ggfortify)

gehan <- as_tibble(gehan) %>%
  mutate(treat = relevel(treat, ref = "control")) %>%
  print(n = Inf)

## # A tibble: 42 x 4
##     pair  time  cens treat  
##    <int> <int> <int> <fct>  
##  1     1     1     1 control
##  2     1    10     1 6-MP   
##  3     2    22     1 control
##  4     2     7     1 6-MP   
##  5     3     3     1 control
##  6     3    32     0 6-MP   
##  7     4    12     1 control
##  8     4    23     1 6-MP   
##  9     5     8     1 control
## 10     5    22     1 6-MP   
## 11     6    17     1 control
## 12     6     6     1 6-MP   
## 13     7     2     1 control
## 14     7    16     1 6-MP   
## 15     8    11     1 control
## 16     8    34     0 6-MP   
## 17     9     8     1 control
## 18     9    32     0 6-MP   
## 19    10    12     1 control
## 20    10    25     0 6-MP   
## 21    11     2     1 control
## 22    11    11     0 6-MP   
## 23    12     5     1 control
## 24    12    20     0 6-MP   
## 25    13     4     1 control
## 26    13    19     0 6-MP   
## 27    14    15     1 control
## 28    14     6     1 6-MP   
## 29    15     8     1 control
## 30    15    17     0 6-MP   
## 31    16    23     1 control
## 32    16    35     0 6-MP   
## 33    17     5     1 control
## 34    17     6     1 6-MP   
## 35    18    11     1 control
## 36    18    13     1 6-MP   
## 37    19     4     1 control
## 38    19     9     0 6-MP   
## 39    20     1     1 control
## 40    20     6     0 6-MP   
## 41    21     8     1 control
## 42    21    10     0 6-MP

Numerical summary:

summary(gehan)

##       pair         time            cens            treat   
##  Min.   : 1   Min.   : 1.00   Min.   :0.0000   control:21  
##  1st Qu.: 6   1st Qu.: 6.00   1st Qu.:0.0000   6-MP   :21  
##  Median :11   Median :10.50   Median :1.0000               
##  Mean   :11   Mean   :12.88   Mean   :0.7143               
##  3rd Qu.:16   3rd Qu.:18.50   3rd Qu.:1.0000               
##  Max.   :21   Max.   :35.00   Max.   :1.0000

Graphical summary:

gehan %>%
  #filter(treat == "control") %>%
  ggplot() + 
  geom_dotplot(mapping = aes(x = time, fill = factor(cens)), binwidth = 1) + 
  facet_wrap(~ treat)

library(survival)

kmfit <- survfit(Surv(time, cens) ~ treat, data = gehan)
summary(kmfit)

## Call: survfit(formula = Surv(time, cens) ~ treat, data = gehan)
## 
##                 treat=control 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1     21       2   0.9048  0.0641      0.78754        1.000
##     2     19       2   0.8095  0.0857      0.65785        0.996
##     3     17       1   0.7619  0.0929      0.59988        0.968
##     4     16       2   0.6667  0.1029      0.49268        0.902
##     5     14       2   0.5714  0.1080      0.39455        0.828
##     8     12       4   0.3810  0.1060      0.22085        0.657
##    11      8       2   0.2857  0.0986      0.14529        0.562
##    12      6       2   0.1905  0.0857      0.07887        0.460
##    15      4       1   0.1429  0.0764      0.05011        0.407
##    17      3       1   0.0952  0.0641      0.02549        0.356
##    22      2       1   0.0476  0.0465      0.00703        0.322
##    23      1       1   0.0000     NaN           NA           NA
## 
##                 treat=6-MP 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     6     21       3    0.857  0.0764        0.720        1.000
##     7     17       1    0.807  0.0869        0.653        0.996
##    10     15       1    0.753  0.0963        0.586        0.968
##    13     12       1    0.690  0.1068        0.510        0.935
##    16     11       1    0.627  0.1141        0.439        0.896
##    22      7       1    0.538  0.1282        0.337        0.858
##    23      6       1    0.448  0.1346        0.249        0.807

autoplot(kmfit)

Exercise: Manually calculate the Kaplan-Meier estimator for the treatment and control groups.

Estimation

A datum of a survival data set is $(y_j, \delta_j, \mathbf{x}_j)$ where
- $y_j$ is the survival time,
- $\delta_j$ is the censoring indicator with $\delta_j = 1$ if the survival time is uncensored and $\delta_j=0$ if it is censored, and
- $\mathbf{x}_j$ is the predictor vector.
Let $y_1, \ldots, y_r$ be the uncensored observations and $y_{r+1}, \ldots, y_n$ the censored ones. The likelihood is \[ L = \prod_{j=1}^r f(y_j) \times \prod_{j=r+1}^n S(y_j) = \prod_{j=1}^n f(y_j)^{\delta_j} S(y_j)^{1 - \delta_j} \] so the log-likelihood is \[ \ell = \sum_{i=1}^n [\delta_j \log f(y_j) + (1 - \delta_j) \log S(y_j)] = \sum_{j=1}^n [\delta_j \log h(y_j) + \log S(y_j)]. \]
Esimtation by MLE.

Exponential model

The log-likelihood for the exponential model is \[ \ell(\boldsymbol{\theta}) = \sum_j [\delta_j \log h(y_j) + \log S(y_j)] = \sum_j (\delta_j \log \theta_j - \theta_j y_j). \] The right hand side resembles the log-likelihood if treating $\delta_j$ as Poisson$(\mu_j)$ where $\mu_j = \theta_j y_j$. Only the survival times $y_j$ are included in the model as an offset term $\log y_j$.
Proportional hazards model is modeled with $\theta = e^{\mathbf{x}^T \boldsymbol{\beta}}$. We have \[ \ell(\boldsymbol{\beta}) = \sum_j (\delta_j \cdot \mathbf{x}_j^T \boldsymbol{\beta} - y_j e^{\mathbf{x}_j^T \boldsymbol{\beta}}) = \sum_j \left\{ \delta_j \cdot [\mathbf{x}_j^T \boldsymbol{\beta} + \log(y_j)] - y_j e^{\mathbf{x}_j^T \boldsymbol{\beta}} - \delta_j \log(y_j) \right\}. \]

glm(cens ~ treat + offset(log(time)), family = poisson, data = gehan) %>%
  summary()

## 
## Call:
## glm(formula = cens ~ treat + offset(log(time)), family = poisson, 
##     data = gehan)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.32472  -0.75468   0.07899   0.87500   1.59679  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.1595     0.2182  -9.896  < 2e-16 ***
## treat6-MP    -1.5266     0.3984  -3.832 0.000127 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 54.503  on 41  degrees of freedom
## Residual deviance: 38.017  on 40  degrees of freedom
## AIC: 102.02
## 
## Number of Fisher Scoring iterations: 6

Check against the standard software in survival package. The coefficients have opposite sign because survreg links to the mean of exponential which is $\theta^{-1}$ according to the accelerated failure time (AFT) model.

emod <- survreg(Surv(time, cens) ~ treat, dist = "exponential", data = gehan)
summary(emod)

## 
## Call:
## survreg(formula = Surv(time, cens) ~ treat, data = gehan, dist = "exponential")
##             Value Std. Error    z       p
## (Intercept) 2.159      0.218 9.90 < 2e-16
## treat6-MP   1.527      0.398 3.83 0.00013
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -108.5   Loglik(intercept only)= -116.8
##  Chisq= 16.49 on 1 degrees of freedom, p= 4.9e-05 
## Number of Newton-Raphson Iterations: 4 
## n= 42

Weibull model

With link function $\phi = \alpha e^{\mathbf{x}^T \boldsymbol{\beta}}$, the log-likelihood of Weibull model is \[\begin{eqnarray*} \ell(\boldsymbol{\beta}, \lambda) &=& \sum_j [\delta_j \log h(y_j) + \log S(y_j)] \\ &=& \sum_{j=1}^n [\delta_j \log (\lambda \alpha y_j^{\lambda - 1} e^{\mathbf{x}_j^T \boldsymbol{\beta}}) - \alpha y_j^{\lambda} e^{\mathbf{x}_j^T \boldsymbol{\beta}}] \\ &=& \sum_{j=1}^n [\delta_j [\log (\alpha y_j^{\lambda} e^{\mathbf{x}^T \boldsymbol{\beta}}) + \log \lambda - \log y_j] - \alpha y_j^{\lambda} e^{\mathbf{x}_j^T \boldsymbol{\beta}}] \\ &=& \sum_{j=1}^n \{\delta_j [\log \alpha + \lambda \log y_j + \mathbf{x}_j^T \boldsymbol{\beta}] - \alpha y_j^{\lambda} e^{\mathbf{x}_j^T \boldsymbol{\beta}} + \delta_j(\log \lambda - \log y_j)\}. \end{eqnarray*}\] Again this is almost the log-likelihood if we pretend $\delta_j$ are independent Poisson$(\mu_j)$ where $\mu_j = \alpha y_j^{\lambda} e^{\mathbf{x}^T \boldsymbol{\beta}}$ and include $\lambda \log y_j$ as offset.
Setting the derivative wrt $\lambda$ to 0 \[ \frac{\partial \ell}{\partial \lambda} = \sum_{j=1}^n \frac{\delta_j}{\lambda} + \delta_j \log y_j - (\log y_j) (\alpha y_j^{\lambda} e^{\mathbf{x}^T \boldsymbol{\beta}}) = 0 \] suggests an update \[ \widehat{\lambda} = \frac{\sum_j \delta_j}{\sum_{j=1}^n (\mu_j - \delta_j) \log y_j}. \] This suggests an estimation procedure that alternately updates $\lambda$ by above equation and then updates $\boldsymbol{\beta}$ by a Poisson regression with offset $\lambda \log (y_i)$. The damping update recommended by Aitkin and Clayton (1980) improves the convergence \[ \lambda^{(k)} = \frac{\lambda^{(k-1)} + \widehat{\lambda}}{2}. \]
Let’s manually implement the estimation procedure, starting from the exponential model with $\lambda = 1$.

sumdelta <- sum(gehan$cens)
lambda <- 1
for (iter in 1:10) {
  pmod <- glm(cens ~ treat + offset(lambda * log(time)), family = poisson, data = gehan)
  cat(iter, "lambda=", lambda, "\t", "beta(treatment)=", coef(pmod)[2], "\n")
  mu <- predict(pmod, type = "response")
  lambda_new <- sumdelta / sum((mu - gehan$cens) * log(gehan$time))
  lambda = (lambda + lambda_new) / 2
}

## 1 lambda= 1   beta(treatment)= -1.526614 
## 2 lambda= 1.315794    beta(treatment)= -1.703675 
## 3 lambda= 1.35468     beta(treatment)= -1.724859 
## 4 lambda= 1.363214    beta(treatment)= -1.729492 
## 5 lambda= 1.365169    beta(treatment)= -1.730553 
## 6 lambda= 1.365621    beta(treatment)= -1.730798 
## 7 lambda= 1.365726    beta(treatment)= -1.730855 
## 8 lambda= 1.36575     beta(treatment)= -1.730868 
## 9 lambda= 1.365756    beta(treatment)= -1.730871 
## 10 lambda= 1.365757   beta(treatment)= -1.730872

summary(pmod)

## 
## Call:
## glm(formula = cens ~ treat + offset(lambda * log(time)), family = poisson, 
##     data = gehan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5148  -0.8408   0.2221   1.0481   2.0577  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.0707     0.2182 -14.072  < 2e-16 ***
## treat6-MP    -1.7309     0.3984  -4.344  1.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 74.045  on 41  degrees of freedom
## Residual deviance: 52.831  on 40  degrees of freedom
## AIC: 116.83
## 
## Number of Fisher Scoring iterations: 6

Note the inference (standard errors) is for the Poisson model. Correct inference can be obtained from the information matrix of $(\lambda, \boldsymbol{\beta})$. Exercise: interpret the coefficient $\beta_{\text{treatment}} = -1.73$.

The survival function in R fits a AFT model. The coefficients from an AFT model can be translated to PH model. Details are omitted here (learn that in a surival class.)

wmod <- survreg(Surv(time, cens) ~ treat, dist = "weibull", data = gehan)
summary(wmod)

## 
## Call:
## survreg(formula = Surv(time, cens) ~ treat, data = gehan, dist = "weibull")
##              Value Std. Error     z       p
## (Intercept)  2.248      0.166 13.55 < 2e-16
## treat6-MP    1.267      0.311  4.08 4.5e-05
## Log(scale)  -0.312      0.147 -2.12   0.034
## 
## Scale= 0.732 
## 
## Weibull distribution
## Loglik(model)= -106.6   Loglik(intercept only)= -116.4
##  Chisq= 19.65 on 1 degrees of freedom, p= 9.3e-06 
## Number of Newton-Raphson Iterations: 5 
## n= 42

Model checking

Check probability model

For exponential model, \[ S(y; \theta) = e^{-\theta y}. \] So plot of $- \log[\widehat{S}(y)]$ against y should be approximately linear.

tibble(surv = kmfit$surv,
       time = kmfit$time) %>%
  ggplot() + 
  geom_point(mapping = aes(x = time, y = - log(surv))) + 
  labs(x = "Event time", y = "-log(Surv. Prob.)")

For Weibull model, \[ S(y; \lambda, \phi) = e^{-\phi y^\lambda}. \] So we expect the plot of $\log [- \log \widehat{S}(y)]$ against $\log y$ to be linear.

tibble(surv = kmfit$surv,
       time = kmfit$time) %>%
  ggplot() + 
  geom_point(mapping = aes(x = log(time), y = log(- log(surv)))) + 
  labs(x = "log(Event Time)", y = "log[-log(Surv. Prob.)]")

If there is some curvi-linear relationship, some other probability models e.g. log-logistic may be better.

Check proportional hazard assumption

For proportional hazards model \[\begin{eqnarray*} h(y) = h_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}}. \end{eqnarray*}\] Consider a binary predictor $x_k \in \{0,1\}$, then \[ \log H_1 = \log H_0 + \beta_k. \] So the $\log [- \log \widehat{S}(y)]$ against $\log y$ plot stratified according to $x_k$ should be parallel with distance $\beta_k$.

Residual plots

Cox-Snell residuals. For uncensored survival time of subject $j$, \[ r_{C_j} = \widehat{H}_j(Y_j) = - \log [\widehat{S}_j(y_j)]. \] For censored observations, $r'_{C_j} = r_{C_j} + \Delta$, where $\Delta = 1$ or $\log 2$. Distribution of $r'_{C_j}$ is compared to exponential with unit mean.
Martingale residuals \[ r_{M_j} = \delta_j - r_{C_j}. \] These residuals have expected value of 0 but a negatively skewed distribution.
Deviance residuals. \[ r_{D_j} = \text{sign}(r_{M_j}) \{ -2[r_{M_j} + \delta_j \log (r_{C_j})] \}^{1/2}. \] Deviance residuals are approximately symmetrically distributed about 0.