Display system information and load tidyverse.

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.13       digest_0.6.25  
## [13] rlang_0.4.5     evaluate_0.14
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

This lecture note draws heavily on the Chapter 10 of textbook An Introduction to Generalized Linear Models by Dobson and Barnett (2nd or 3rd edition). ULCA library link to the 2nd edition.

Introduction

Survivor functions and hazard functions

Exponential distribution

  • A simple model for \(Y\) is the exponential distribution with density \[ f(y; \theta) = \theta e^{-\theta y}, \quad y \ge 0, \theta > 0 \] and moments \(\mathbb{E}Y = \theta^{-1}\), \(\operatorname{Var} Y = \theta^{-2}\).
y <- seq(0, 2.5, by = 0.01)
theta <- 2
exp_df <- tibble(y = y,
                 f = dexp(y, theta),
                 F = pexp(y, theta),
                 S = 1 - F,
                 h = theta,
                 H = theta * y)
exp_df %>%  
  ggplot() + 
  geom_line(mapping = aes(x = y, y = f)) + 
  labs(x = "y", y = "f(y)", title = "Exponential density")

  • The cdf of exponential is \[ F(y; \theta) = \int_0^y \theta e^{-\theta t} \, dt = 1 - e^{-\theta y}. \]

  • The survivor function is \[ S(y,; \theta) = e^{-\theta y}. \]

exp_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = S)) + 
  labs(x = "y", y = "S(y)", title = "Survivor function of exponential")

  • The hazard function is \[ h(y; \theta) = \frac{f(y; \theta)}{S(y; \theta)} = \theta. \] The hazard function does not depent on \(y\). This lack of memory property may be a limitation because in many applications the probability of failure often increases with time.
exp_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = h)) + 
  labs(x = "y", y = "h(y)", title = "Hazard function of exponential")

  • The cumulative hazard function is \[ H(y; \theta) = \theta y. \]
exp_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = H)) + 
  labs(x = "y", y = "H(y)", title = "Cumulative hazard function of exponential")

Proportional hazards model

  • For the exponential distribution, to ensure \(\theta>0\), it is common to use the log link \[ \theta = e^{\mathbf{x}^T \boldsymbol{\beta}}. \] Then the hazard function has the multiplicative form \[ h(y; \boldsymbol{\beta}) = \theta = e^{\mathbf{x}^T \boldsymbol{\beta}}. \] A unit change in \(x_k\) will result in a hazard ratio or relative hazard \[ \frac{h_1(y; \boldsymbol{\beta})}{h_0(y; \boldsymbol{\beta})} = e^{\beta_k}. \]

  • Any model of the form \[ h_1(y) = h_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}} \] is called a proportional hazards model. \(h_0(y)\), which is hazard function corresponding to the reference levels for all the explanatory variables, is called the baseline hazard.

  • For proportional hazards models, the cumulative hazard function is given by \[ H_1(y) = \int_0^y h_1(t) \, dt = \int_0^y h_0(t) e^{\mathbf{x}^T \boldsymbol{\beta}} \, dt = H_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}}, \] so \[ \log H_1(y) = \log H_0(y) + \mathbf{x}^T \boldsymbol{\beta}. \]

Weibull distribution

  • Another commonly used model for survival times is the Weibull distribution with density \[ f(y; \lambda, \theta) = \frac{\lambda y^{\lambda - 1}}{\theta^\lambda} e^{-(y / \theta)^\lambda}, \quad y \ge 0, \lambda > 0, \theta > 0, \] where \(\lambda\) and \(\theta\) are shape and scale parameters respectively.
y <- seq(0, 2.5, by = 0.01)
lambda <- 5 # shape
theta <- 1 # scale
phi = exp(-lambda * log(theta))
wb_df <- tibble(y = y,
                 f = dweibull(y, lambda, theta),
                 F = pweibull(y, lambda, theta),
                 S = 1 - F,
                 h = lambda * phi * exp((lambda - 1 ) * log(y)),
                 H = phi * exp(lambda * log(y)))
wb_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = f)) + 
  labs(x = "y", y = "f(y)", title = "Weibull density")

  • With reparameterization \(\theta^{-\lambda} = \phi\), the density is \[ f(y; \lambda, \phi) = \lambda \phi y^{\lambda - 1} e^{-\phi y^{\lambda}}. \] Exponential distribution is a special case of Weibull with \(\lambda = 1\).

  • The survivor function for Weibull is \[ S(y; \lambda, \phi) = \int_y^\infty \lambda \phi u^{\lambda - 1} e^{-\phi u^{\lambda}} \, du = e^{-\phi y^{\lambda}}. \]

wb_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = S)) + 
  labs(x = "y", y = "S(y)", title = "Survivor functiion of Weibull")

  • The hazard function is \[ h(y; \lambda, \phi) = \frac{f(y; \lambda, \phi)}{S(y; \lambda, \phi)} = \lambda \phi y^{\lambda - 1}. \]

    For \(\lambda > 1\), the hazard function depends on \(y\) and yields increasing Weibull or accelerated failure time models. Such model might be expected for leukemia patients not responding to treatment, where the event of interest is death. As survival time increases for such a patient, and as the prognosis accordingly worsens, the patient’s potential for dying of the disease also increases.

    For \(\lambda < 1\), the hazard function depends on \(y\) and yields decreasing Weibull models. Such model might be expected when the event is death in persons who are recovering from surgery, because the potential for dying after surgery usually decreases as the time after surgery increases.

wb_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = h)) + 
  labs(x = "y", y = "h(y)", title = "Hazard functiion of Weibull")

  • The cumulative hazard function is \[ H(y; \lambda, \phi) = \phi y^{\lambda}. \]
wb_df %>%
  ggplot() + 
  geom_line(mapping = aes(x = y, y = H)) + 
  labs(x = "y", y = "f(y)", title = "Cummulative hazard functiion of Weibull")

  • The mean of Weibull is \[ \mathbb{E} Y = \int_0^\infty \lambda \phi y^{\lambda} e^{-\phi y^\lambda} \, dy = \phi^{-1/\lambda} \Gamma(1 + 1/\lambda). \] The median of Weibull is \[ y(50) = \phi^{-1/\lambda} (\log 2)^{1/\lambda}. \]

  • We model relationship between \(Y\) and predictors in terms of \(\phi\) \[ \phi = \alpha e^{\mathbf{x}^T \boldsymbol{\beta}}. \] Then the hazard function is \[ h(y; \lambda, \phi) = \lambda \alpha y^{\lambda - 1} e^{\mathbf{x}^T \boldsymbol{\beta}}. \] If \(h_0(y)\) is the baseline hazard function corresponding to reference levels of all predictors, then \[ h(y) = h_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}}, \] which is a proportional hazards model.

Kaplan-Meier estimate

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggfortify)

gehan <- as_tibble(gehan) %>%
  mutate(treat = relevel(treat, ref = "control")) %>%
  print(n = Inf)
## # A tibble: 42 x 4
##     pair  time  cens treat  
##    <int> <int> <int> <fct>  
##  1     1     1     1 control
##  2     1    10     1 6-MP   
##  3     2    22     1 control
##  4     2     7     1 6-MP   
##  5     3     3     1 control
##  6     3    32     0 6-MP   
##  7     4    12     1 control
##  8     4    23     1 6-MP   
##  9     5     8     1 control
## 10     5    22     1 6-MP   
## 11     6    17     1 control
## 12     6     6     1 6-MP   
## 13     7     2     1 control
## 14     7    16     1 6-MP   
## 15     8    11     1 control
## 16     8    34     0 6-MP   
## 17     9     8     1 control
## 18     9    32     0 6-MP   
## 19    10    12     1 control
## 20    10    25     0 6-MP   
## 21    11     2     1 control
## 22    11    11     0 6-MP   
## 23    12     5     1 control
## 24    12    20     0 6-MP   
## 25    13     4     1 control
## 26    13    19     0 6-MP   
## 27    14    15     1 control
## 28    14     6     1 6-MP   
## 29    15     8     1 control
## 30    15    17     0 6-MP   
## 31    16    23     1 control
## 32    16    35     0 6-MP   
## 33    17     5     1 control
## 34    17     6     1 6-MP   
## 35    18    11     1 control
## 36    18    13     1 6-MP   
## 37    19     4     1 control
## 38    19     9     0 6-MP   
## 39    20     1     1 control
## 40    20     6     0 6-MP   
## 41    21     8     1 control
## 42    21    10     0 6-MP
summary(gehan)
##       pair         time            cens            treat   
##  Min.   : 1   Min.   : 1.00   Min.   :0.0000   control:21  
##  1st Qu.: 6   1st Qu.: 6.00   1st Qu.:0.0000   6-MP   :21  
##  Median :11   Median :10.50   Median :1.0000               
##  Mean   :11   Mean   :12.88   Mean   :0.7143               
##  3rd Qu.:16   3rd Qu.:18.50   3rd Qu.:1.0000               
##  Max.   :21   Max.   :35.00   Max.   :1.0000
gehan %>%
  #filter(treat == "control") %>%
  ggplot() + 
  geom_dotplot(mapping = aes(x = time, fill = factor(cens)), binwidth = 1) + 
  facet_wrap(~ treat)

library(survival)

kmfit <- survfit(Surv(time, cens) ~ treat, data = gehan)
summary(kmfit)
## Call: survfit(formula = Surv(time, cens) ~ treat, data = gehan)
## 
##                 treat=control 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1     21       2   0.9048  0.0641      0.78754        1.000
##     2     19       2   0.8095  0.0857      0.65785        0.996
##     3     17       1   0.7619  0.0929      0.59988        0.968
##     4     16       2   0.6667  0.1029      0.49268        0.902
##     5     14       2   0.5714  0.1080      0.39455        0.828
##     8     12       4   0.3810  0.1060      0.22085        0.657
##    11      8       2   0.2857  0.0986      0.14529        0.562
##    12      6       2   0.1905  0.0857      0.07887        0.460
##    15      4       1   0.1429  0.0764      0.05011        0.407
##    17      3       1   0.0952  0.0641      0.02549        0.356
##    22      2       1   0.0476  0.0465      0.00703        0.322
##    23      1       1   0.0000     NaN           NA           NA
## 
##                 treat=6-MP 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     6     21       3    0.857  0.0764        0.720        1.000
##     7     17       1    0.807  0.0869        0.653        0.996
##    10     15       1    0.753  0.0963        0.586        0.968
##    13     12       1    0.690  0.1068        0.510        0.935
##    16     11       1    0.627  0.1141        0.439        0.896
##    22      7       1    0.538  0.1282        0.337        0.858
##    23      6       1    0.448  0.1346        0.249        0.807
autoplot(kmfit)

Estimation

Exponential model

  • The log-likelihood for the exponential model is \[ \ell(\boldsymbol{\theta}) = \sum_j [\delta_j \log h(y_j) + \log S(y_j)] = \sum_j (\delta_j \log \theta_j - \theta_j y_j). \] The right hand side resembles the log-likelihood if treating \(\delta_j\) as Poisson\((\mu_j)\) where \(\mu_j = \theta_j y_j\). Only the survival times \(y_j\) are included in the model as an offset term \(\log y_j\).

  • Proportional hazards model is modeled with \(\theta = e^{\mathbf{x}^T \boldsymbol{\beta}}\). We have \[ \ell(\boldsymbol{\beta}) = \sum_j (\delta_j \cdot \mathbf{x}_j^T \boldsymbol{\beta} - y_j e^{\mathbf{x}_j^T \boldsymbol{\beta}}) = \sum_j \left\{ \delta_j \cdot [\mathbf{x}_j^T \boldsymbol{\beta} + \log(y_j)] - y_j e^{\mathbf{x}_j^T \boldsymbol{\beta}} - \delta_j \log(y_j) \right\}. \]

glm(cens ~ treat + offset(log(time)), family = poisson, data = gehan) %>%
  summary()
## 
## Call:
## glm(formula = cens ~ treat + offset(log(time)), family = poisson, 
##     data = gehan)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.32472  -0.75468   0.07899   0.87500   1.59679  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.1595     0.2182  -9.896  < 2e-16 ***
## treat6-MP    -1.5266     0.3984  -3.832 0.000127 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 54.503  on 41  degrees of freedom
## Residual deviance: 38.017  on 40  degrees of freedom
## AIC: 102.02
## 
## Number of Fisher Scoring iterations: 6
  • Check against the standard software in survival package. The coefficients have opposite sign because survreg links to the mean of exponential which is \(\theta^{-1}\) according to the accelerated failure time (AFT) model.
emod <- survreg(Surv(time, cens) ~ treat, dist = "exponential", data = gehan)
summary(emod)
## 
## Call:
## survreg(formula = Surv(time, cens) ~ treat, data = gehan, dist = "exponential")
##             Value Std. Error    z       p
## (Intercept) 2.159      0.218 9.90 < 2e-16
## treat6-MP   1.527      0.398 3.83 0.00013
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -108.5   Loglik(intercept only)= -116.8
##  Chisq= 16.49 on 1 degrees of freedom, p= 4.9e-05 
## Number of Newton-Raphson Iterations: 4 
## n= 42

Weibull model

  • With link function \(\phi = \alpha e^{\mathbf{x}^T \boldsymbol{\beta}}\), the log-likelihood of Weibull model is \[\begin{eqnarray*} \ell(\boldsymbol{\beta}, \lambda) &=& \sum_j [\delta_j \log h(y_j) + \log S(y_j)] \\ &=& \sum_{j=1}^n [\delta_j \log (\lambda \alpha y_j^{\lambda - 1} e^{\mathbf{x}_j^T \boldsymbol{\beta}}) - \alpha y_j^{\lambda} e^{\mathbf{x}_j^T \boldsymbol{\beta}}] \\ &=& \sum_{j=1}^n [\delta_j [\log (\alpha y_j^{\lambda} e^{\mathbf{x}^T \boldsymbol{\beta}}) + \log \lambda - \log y_j] - \alpha y_j^{\lambda} e^{\mathbf{x}_j^T \boldsymbol{\beta}}] \\ &=& \sum_{j=1}^n \{\delta_j [\log \alpha + \lambda \log y_j + \mathbf{x}_j^T \boldsymbol{\beta}] - \alpha y_j^{\lambda} e^{\mathbf{x}_j^T \boldsymbol{\beta}} + \delta_j(\log \lambda - \log y_j)\}. \end{eqnarray*}\] Again this is almost the log-likelihood if we pretend \(\delta_j\) are independent Poisson\((\mu_j)\) where \(\mu_j = \alpha y_j^{\lambda} e^{\mathbf{x}^T \boldsymbol{\beta}}\) and include \(\lambda \log y_j\) as offset.

  • Setting the derivative wrt \(\lambda\) to 0 \[ \frac{\partial \ell}{\partial \lambda} = \sum_{j=1}^n \frac{\delta_j}{\lambda} + \delta_j \log y_j - (\log y_j) (\alpha y_j^{\lambda} e^{\mathbf{x}^T \boldsymbol{\beta}}) = 0 \] suggests an update \[ \widehat{\lambda} = \frac{\sum_j \delta_j}{\sum_{j=1}^n (\mu_j - \delta_j) \log y_j}. \] This suggests an estimation procedure that alternately updates \(\lambda\) by above equation and then updates \(\boldsymbol{\beta}\) by a Poisson regression with offset \(\lambda \log (y_i)\). The damping update recommended by Aitkin and Clayton (1980) improves the convergence \[ \lambda^{(k)} = \frac{\lambda^{(k-1)} + \widehat{\lambda}}{2}. \]

  • Let’s manually implement the estimation procedure, starting from the exponential model with \(\lambda = 1\).

sumdelta <- sum(gehan$cens)
lambda <- 1
for (iter in 1:10) {
  pmod <- glm(cens ~ treat + offset(lambda * log(time)), family = poisson, data = gehan)
  cat(iter, "lambda=", lambda, "\t", "beta(treatment)=", coef(pmod)[2], "\n")
  mu <- predict(pmod, type = "response")
  lambda_new <- sumdelta / sum((mu - gehan$cens) * log(gehan$time))
  lambda = (lambda + lambda_new) / 2
}
## 1 lambda= 1   beta(treatment)= -1.526614 
## 2 lambda= 1.315794    beta(treatment)= -1.703675 
## 3 lambda= 1.35468     beta(treatment)= -1.724859 
## 4 lambda= 1.363214    beta(treatment)= -1.729492 
## 5 lambda= 1.365169    beta(treatment)= -1.730553 
## 6 lambda= 1.365621    beta(treatment)= -1.730798 
## 7 lambda= 1.365726    beta(treatment)= -1.730855 
## 8 lambda= 1.36575     beta(treatment)= -1.730868 
## 9 lambda= 1.365756    beta(treatment)= -1.730871 
## 10 lambda= 1.365757   beta(treatment)= -1.730872
summary(pmod)
## 
## Call:
## glm(formula = cens ~ treat + offset(lambda * log(time)), family = poisson, 
##     data = gehan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5148  -0.8408   0.2221   1.0481   2.0577  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.0707     0.2182 -14.072  < 2e-16 ***
## treat6-MP    -1.7309     0.3984  -4.344  1.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 74.045  on 41  degrees of freedom
## Residual deviance: 52.831  on 40  degrees of freedom
## AIC: 116.83
## 
## Number of Fisher Scoring iterations: 6

Note the inference (standard errors) is for the Poisson model. Correct inference can be obtained from the information matrix of \((\lambda, \boldsymbol{\beta})\). Exercise: interpret the coefficient \(\beta_{\text{treatment}} = -1.73\).

  • The survival function in R fits a AFT model. The coefficients from an AFT model can be translated to PH model. Details are omitted here (learn that in a surival class.)
wmod <- survreg(Surv(time, cens) ~ treat, dist = "weibull", data = gehan)
summary(wmod)
## 
## Call:
## survreg(formula = Surv(time, cens) ~ treat, data = gehan, dist = "weibull")
##              Value Std. Error     z       p
## (Intercept)  2.248      0.166 13.55 < 2e-16
## treat6-MP    1.267      0.311  4.08 4.5e-05
## Log(scale)  -0.312      0.147 -2.12   0.034
## 
## Scale= 0.732 
## 
## Weibull distribution
## Loglik(model)= -106.6   Loglik(intercept only)= -116.4
##  Chisq= 19.65 on 1 degrees of freedom, p= 9.3e-06 
## Number of Newton-Raphson Iterations: 5 
## n= 42

Model checking

Check probability model

  • For exponential model, \[ S(y; \theta) = e^{-\theta y}. \] So plot of \(- \log[\widehat{S}(y)]\) against y should be approximately linear.
tibble(surv = kmfit$surv,
       time = kmfit$time) %>%
  ggplot() + 
  geom_point(mapping = aes(x = time, y = - log(surv))) + 
  labs(x = "Event time", y = "-log(Surv. Prob.)")

  • For Weibull model, \[ S(y; \lambda, \phi) = e^{-\phi y^\lambda}. \] So we expect the plot of \(\log [- \log \widehat{S}(y)]\) against \(\log y\) to be linear.
tibble(surv = kmfit$surv,
       time = kmfit$time) %>%
  ggplot() + 
  geom_point(mapping = aes(x = log(time), y = log(- log(surv)))) + 
  labs(x = "log(Event Time)", y = "log[-log(Surv. Prob.)]")

  • If there is some curvi-linear relationship, some other probability models e.g. log-logistic may be better.

Check proportional hazard assumption

  • For proportional hazards model \[\begin{eqnarray*} h(y) = h_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}}. \end{eqnarray*}\] Consider a binary predictor \(x_k \in \{0,1\}\), then \[ \log H_1 = \log H_0 + \beta_k. \] So the \(\log [- \log \widehat{S}(y)]\) against \(\log y\) plot stratified according to \(x_k\) should be parallel with distance \(\beta_k\).

Residual plots

  • Cox-Snell residuals. For uncensored survival time of subject \(j\), \[ r_{C_j} = \widehat{H}_j(Y_j) = - \log [\widehat{S}_j(y_j)]. \] For censored observations, \(r'_{C_j} = r_{C_j} + \Delta\), where \(\Delta = 1\) or \(\log 2\). Distribution of \(r'_{C_j}\) is compared to exponential with unit mean.

  • Martingale residuals \[ r_{M_j} = \delta_j - r_{C_j}. \] These residuals have expected value of 0 but a negatively skewed distribution.

  • Deviance residuals. \[ r_{D_j} = \text{sign}(r_{M_j}) \{ -2[r_{M_j} + \delta_j \log (r_{C_j})] \}^{1/2}. \] Deviance residuals are approximately symmetrically distributed about 0.