Display system information and load tidyverse
.
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.2 magrittr_1.5 tools_3.6.2 htmltools_0.4.0
## [5] yaml_2.2.1 Rcpp_1.0.4.6 stringi_1.4.6 rmarkdown_2.1
## [9] knitr_1.28 stringr_1.4.0 xfun_0.13 digest_0.6.25
## [13] rlang_0.4.5 evaluate_0.14
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.0 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
This lecture note draws heavily on the Chapter 10 of textbook An Introduction to Generalized Linear Models by Dobson and Barnett (2nd or 3rd edition). ULCA library link to the 2nd edition.
Survival model studies the time from a well-defined starting point until some event, called “failure”, occurs. In other words, survial models studies time-to-event.
Survival outcomes, or survival times, have at least two important features:
the times are non-negative and typically have skewed distributions with long tails.
censoring: some subjects may survive beyond the study period so their actual failure times are unknown.
In following diagram, subject 3 is right-censored, subject 4 is right-censored due to loss to follow up, and subject 5 is left-censored.
Let \(Y=\) survival time, and \(f(y)\) be its density. The probability of failure before a specific time \(y\) is given by the cummulative distribution function (cdf) \[ F(y) = \mathbb{P}(Y \le y) = \int_0^y f(t) \, dt. \]
The survivor function is the probability of survival beyond time \(y\) \[ S(y) = \mathbb{P}(Y > y) = 1 - F(y). \]
The hazard function is the probability of death in an infinitesimal small interval \([y, y + \Delta y]\), given surivival up to time \(y\) \[\begin{eqnarray*} h(y) &=& \lim_{\Delta y \to 0} \frac{\mathbb{P}(y < Y \le y + \Delta y \mid Y > y)}{\Delta y} \\ &=& \lim_{\Delta y \to 0} \frac{F(y + \Delta y) - F(y)}{\Delta y} \times \frac{1}{S(y)} \\ &=& \frac{f(y)}{S(y)}. \end{eqnarray*}\]
Rewrite the hazard function as \[ h(y) = - \frac{\operatorname{d}}{\operatorname{d} y} \log S(y). \] Then we have \[ S(y) = e^{-H(y)} = e^{- \int_0^y h(t) \, dt} \] or \[ H(y) = - \log S(y). \] \(H(y)\) is called the cumulative hazard function or the integrated hazard function.
The median survival time, \(y(50)\), is given by the solution of \(F(y) = 1/2\).
y <- seq(0, 2.5, by = 0.01)
theta <- 2
exp_df <- tibble(y = y,
f = dexp(y, theta),
F = pexp(y, theta),
S = 1 - F,
h = theta,
H = theta * y)
exp_df %>%
ggplot() +
geom_line(mapping = aes(x = y, y = f)) +
labs(x = "y", y = "f(y)", title = "Exponential density")
The cdf of exponential is \[ F(y; \theta) = \int_0^y \theta e^{-\theta t} \, dt = 1 - e^{-\theta y}. \]
The survivor function is \[ S(y,; \theta) = e^{-\theta y}. \]
exp_df %>%
ggplot() +
geom_line(mapping = aes(x = y, y = S)) +
labs(x = "y", y = "S(y)", title = "Survivor function of exponential")
exp_df %>%
ggplot() +
geom_line(mapping = aes(x = y, y = h)) +
labs(x = "y", y = "h(y)", title = "Hazard function of exponential")
exp_df %>%
ggplot() +
geom_line(mapping = aes(x = y, y = H)) +
labs(x = "y", y = "H(y)", title = "Cumulative hazard function of exponential")
For the exponential distribution, to ensure \(\theta>0\), it is common to use the log link \[ \theta = e^{\mathbf{x}^T \boldsymbol{\beta}}. \] Then the hazard function has the multiplicative form \[ h(y; \boldsymbol{\beta}) = \theta = e^{\mathbf{x}^T \boldsymbol{\beta}}. \] A unit change in \(x_k\) will result in a hazard ratio or relative hazard \[ \frac{h_1(y; \boldsymbol{\beta})}{h_0(y; \boldsymbol{\beta})} = e^{\beta_k}. \]
Any model of the form \[ h_1(y) = h_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}} \] is called a proportional hazards model. \(h_0(y)\), which is hazard function corresponding to the reference levels for all the explanatory variables, is called the baseline hazard.
For proportional hazards models, the cumulative hazard function is given by \[ H_1(y) = \int_0^y h_1(t) \, dt = \int_0^y h_0(t) e^{\mathbf{x}^T \boldsymbol{\beta}} \, dt = H_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}}, \] so \[ \log H_1(y) = \log H_0(y) + \mathbf{x}^T \boldsymbol{\beta}. \]
y <- seq(0, 2.5, by = 0.01)
lambda <- 5 # shape
theta <- 1 # scale
phi = exp(-lambda * log(theta))
wb_df <- tibble(y = y,
f = dweibull(y, lambda, theta),
F = pweibull(y, lambda, theta),
S = 1 - F,
h = lambda * phi * exp((lambda - 1 ) * log(y)),
H = phi * exp(lambda * log(y)))
wb_df %>%
ggplot() +
geom_line(mapping = aes(x = y, y = f)) +
labs(x = "y", y = "f(y)", title = "Weibull density")
With reparameterization \(\theta^{-\lambda} = \phi\), the density is \[ f(y; \lambda, \phi) = \lambda \phi y^{\lambda - 1} e^{-\phi y^{\lambda}}. \] Exponential distribution is a special case of Weibull with \(\lambda = 1\).
The survivor function for Weibull is \[ S(y; \lambda, \phi) = \int_y^\infty \lambda \phi u^{\lambda - 1} e^{-\phi u^{\lambda}} \, du = e^{-\phi y^{\lambda}}. \]
wb_df %>%
ggplot() +
geom_line(mapping = aes(x = y, y = S)) +
labs(x = "y", y = "S(y)", title = "Survivor functiion of Weibull")
The hazard function is \[ h(y; \lambda, \phi) = \frac{f(y; \lambda, \phi)}{S(y; \lambda, \phi)} = \lambda \phi y^{\lambda - 1}. \]
For \(\lambda > 1\), the hazard function depends on \(y\) and yields increasing Weibull or accelerated failure time models. Such model might be expected for leukemia patients not responding to treatment, where the event of interest is death. As survival time increases for such a patient, and as the prognosis accordingly worsens, the patient’s potential for dying of the disease also increases.
For \(\lambda < 1\), the hazard function depends on \(y\) and yields decreasing Weibull models. Such model might be expected when the event is death in persons who are recovering from surgery, because the potential for dying after surgery usually decreases as the time after surgery increases.
wb_df %>%
ggplot() +
geom_line(mapping = aes(x = y, y = h)) +
labs(x = "y", y = "h(y)", title = "Hazard functiion of Weibull")
wb_df %>%
ggplot() +
geom_line(mapping = aes(x = y, y = H)) +
labs(x = "y", y = "f(y)", title = "Cummulative hazard functiion of Weibull")
The mean of Weibull is \[ \mathbb{E} Y = \int_0^\infty \lambda \phi y^{\lambda} e^{-\phi y^\lambda} \, dy = \phi^{-1/\lambda} \Gamma(1 + 1/\lambda). \] The median of Weibull is \[ y(50) = \phi^{-1/\lambda} (\log 2)^{1/\lambda}. \]
We model relationship between \(Y\) and predictors in terms of \(\phi\) \[ \phi = \alpha e^{\mathbf{x}^T \boldsymbol{\beta}}. \] Then the hazard function is \[ h(y; \lambda, \phi) = \lambda \alpha y^{\lambda - 1} e^{\mathbf{x}^T \boldsymbol{\beta}}. \] If \(h_0(y)\) is the baseline hazard function corresponding to reference levels of all predictors, then \[ h(y) = h_0(y) e^{\mathbf{x}^T \boldsymbol{\beta}}, \] which is a proportional hazards model.
Arrange the even times \(y_{(1)} \le y_{(2)} \le \cdots \le y_{(k)}\). Kaplan-Meier formula or product limit formula \[\begin{eqnarray*} \widehat{S}(y_{k}) &=& \widehat{\mathbb{P}}(Y > y_{(k)}) \\ &=& \widehat{\mathbb{P}}(Y > y_{(k-1)}) \times \widehat{\mathbb{P}}(Y > y_{(k)} \mid Y > y_{(k-1)}) \\ &=& \widehat{S}(y_{k-1}) \times \widehat{\mathbb{P}}(Y > y_{(k)} \mid Y > y_{(k-1)}) \\ &=& \cdots \\ &=& \prod_{j=1}^k \widehat{\mathbb{P}}(Y > y_{(j)} \mid Y > y_{(j-1)}) \\ &=& \prod_{j=1}^k \left( \frac{n_j - d_j}{n_j} \right), \end{eqnarray*}\] where \[\begin{eqnarray*} n_j &=& \text{number of subjects who are alive just before time $y_{(j)}$}, \\ d_j &=& \text{number of deaths that occur at time $y_{(j)}$}. \end{eqnarray*}\]
The leukemia data gehan
. time
is the remission time in weeks.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggfortify)
gehan <- as_tibble(gehan) %>%
mutate(treat = relevel(treat, ref = "control")) %>%
print(n = Inf)
## # A tibble: 42 x 4
## pair time cens treat
## <int> <int> <int> <fct>
## 1 1 1 1 control
## 2 1 10 1 6-MP
## 3 2 22 1 control
## 4 2 7 1 6-MP
## 5 3 3 1 control
## 6 3 32 0 6-MP
## 7 4 12 1 control
## 8 4 23 1 6-MP
## 9 5 8 1 control
## 10 5 22 1 6-MP
## 11 6 17 1 control
## 12 6 6 1 6-MP
## 13 7 2 1 control
## 14 7 16 1 6-MP
## 15 8 11 1 control
## 16 8 34 0 6-MP
## 17 9 8 1 control
## 18 9 32 0 6-MP
## 19 10 12 1 control
## 20 10 25 0 6-MP
## 21 11 2 1 control
## 22 11 11 0 6-MP
## 23 12 5 1 control
## 24 12 20 0 6-MP
## 25 13 4 1 control
## 26 13 19 0 6-MP
## 27 14 15 1 control
## 28 14 6 1 6-MP
## 29 15 8 1 control
## 30 15 17 0 6-MP
## 31 16 23 1 control
## 32 16 35 0 6-MP
## 33 17 5 1 control
## 34 17 6 1 6-MP
## 35 18 11 1 control
## 36 18 13 1 6-MP
## 37 19 4 1 control
## 38 19 9 0 6-MP
## 39 20 1 1 control
## 40 20 6 0 6-MP
## 41 21 8 1 control
## 42 21 10 0 6-MP
summary(gehan)
## pair time cens treat
## Min. : 1 Min. : 1.00 Min. :0.0000 control:21
## 1st Qu.: 6 1st Qu.: 6.00 1st Qu.:0.0000 6-MP :21
## Median :11 Median :10.50 Median :1.0000
## Mean :11 Mean :12.88 Mean :0.7143
## 3rd Qu.:16 3rd Qu.:18.50 3rd Qu.:1.0000
## Max. :21 Max. :35.00 Max. :1.0000
gehan %>%
#filter(treat == "control") %>%
ggplot() +
geom_dotplot(mapping = aes(x = time, fill = factor(cens)), binwidth = 1) +
facet_wrap(~ treat)
library(survival)
kmfit <- survfit(Surv(time, cens) ~ treat, data = gehan)
summary(kmfit)
## Call: survfit(formula = Surv(time, cens) ~ treat, data = gehan)
##
## treat=control
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 21 2 0.9048 0.0641 0.78754 1.000
## 2 19 2 0.8095 0.0857 0.65785 0.996
## 3 17 1 0.7619 0.0929 0.59988 0.968
## 4 16 2 0.6667 0.1029 0.49268 0.902
## 5 14 2 0.5714 0.1080 0.39455 0.828
## 8 12 4 0.3810 0.1060 0.22085 0.657
## 11 8 2 0.2857 0.0986 0.14529 0.562
## 12 6 2 0.1905 0.0857 0.07887 0.460
## 15 4 1 0.1429 0.0764 0.05011 0.407
## 17 3 1 0.0952 0.0641 0.02549 0.356
## 22 2 1 0.0476 0.0465 0.00703 0.322
## 23 1 1 0.0000 NaN NA NA
##
## treat=6-MP
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 6 21 3 0.857 0.0764 0.720 1.000
## 7 17 1 0.807 0.0869 0.653 0.996
## 10 15 1 0.753 0.0963 0.586 0.968
## 13 12 1 0.690 0.1068 0.510 0.935
## 16 11 1 0.627 0.1141 0.439 0.896
## 22 7 1 0.538 0.1282 0.337 0.858
## 23 6 1 0.448 0.1346 0.249 0.807
autoplot(kmfit)
Let \(y_1, \ldots, y_r\) be the uncensored observations and \(y_{r+1}, \ldots, y_n\) the censored ones. The likelihood is \[ L = \prod_{j=1}^r f(y_j) \times \prod_{j=r+1}^n S(y_j) = \prod_{j=1}^n f(y_j)^{\delta_j} S(y_j)^{1 - \delta_j} \] so the log-likelihood is \[ \ell = \sum_{i=1}^n [\delta_j \log f(y_j) + (1 - \delta_j) \log S(y_j)] = \sum_{j=1}^n [\delta_j \log h(y_j) + \log S(y_j)]. \]
Esimtation by MLE.
The log-likelihood for the exponential model is \[ \ell(\boldsymbol{\theta}) = \sum_j [\delta_j \log h(y_j) + \log S(y_j)] = \sum_j (\delta_j \log \theta_j - \theta_j y_j). \] The right hand side resembles the log-likelihood if treating \(\delta_j\) as Poisson\((\mu_j)\) where \(\mu_j = \theta_j y_j\). Only the survival times \(y_j\) are included in the model as an offset term \(\log y_j\).
Proportional hazards model is modeled with \(\theta = e^{\mathbf{x}^T \boldsymbol{\beta}}\). We have \[ \ell(\boldsymbol{\beta}) = \sum_j (\delta_j \cdot \mathbf{x}_j^T \boldsymbol{\beta} - y_j e^{\mathbf{x}_j^T \boldsymbol{\beta}}) = \sum_j \left\{ \delta_j \cdot [\mathbf{x}_j^T \boldsymbol{\beta} + \log(y_j)] - y_j e^{\mathbf{x}_j^T \boldsymbol{\beta}} - \delta_j \log(y_j) \right\}. \]
glm(cens ~ treat + offset(log(time)), family = poisson, data = gehan) %>%
summary()
##
## Call:
## glm(formula = cens ~ treat + offset(log(time)), family = poisson,
## data = gehan)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.32472 -0.75468 0.07899 0.87500 1.59679
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1595 0.2182 -9.896 < 2e-16 ***
## treat6-MP -1.5266 0.3984 -3.832 0.000127 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 54.503 on 41 degrees of freedom
## Residual deviance: 38.017 on 40 degrees of freedom
## AIC: 102.02
##
## Number of Fisher Scoring iterations: 6
survival
package. The coefficients have opposite sign because survreg
links to the mean of exponential which is \(\theta^{-1}\) according to the accelerated failure time (AFT) model.emod <- survreg(Surv(time, cens) ~ treat, dist = "exponential", data = gehan)
summary(emod)
##
## Call:
## survreg(formula = Surv(time, cens) ~ treat, data = gehan, dist = "exponential")
## Value Std. Error z p
## (Intercept) 2.159 0.218 9.90 < 2e-16
## treat6-MP 1.527 0.398 3.83 0.00013
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -108.5 Loglik(intercept only)= -116.8
## Chisq= 16.49 on 1 degrees of freedom, p= 4.9e-05
## Number of Newton-Raphson Iterations: 4
## n= 42
With link function \(\phi = \alpha e^{\mathbf{x}^T \boldsymbol{\beta}}\), the log-likelihood of Weibull model is \[\begin{eqnarray*} \ell(\boldsymbol{\beta}, \lambda) &=& \sum_j [\delta_j \log h(y_j) + \log S(y_j)] \\ &=& \sum_{j=1}^n [\delta_j \log (\lambda \alpha y_j^{\lambda - 1} e^{\mathbf{x}_j^T \boldsymbol{\beta}}) - \alpha y_j^{\lambda} e^{\mathbf{x}_j^T \boldsymbol{\beta}}] \\ &=& \sum_{j=1}^n [\delta_j [\log (\alpha y_j^{\lambda} e^{\mathbf{x}^T \boldsymbol{\beta}}) + \log \lambda - \log y_j] - \alpha y_j^{\lambda} e^{\mathbf{x}_j^T \boldsymbol{\beta}}] \\ &=& \sum_{j=1}^n \{\delta_j [\log \alpha + \lambda \log y_j + \mathbf{x}_j^T \boldsymbol{\beta}] - \alpha y_j^{\lambda} e^{\mathbf{x}_j^T \boldsymbol{\beta}} + \delta_j(\log \lambda - \log y_j)\}. \end{eqnarray*}\] Again this is almost the log-likelihood if we pretend \(\delta_j\) are independent Poisson\((\mu_j)\) where \(\mu_j = \alpha y_j^{\lambda} e^{\mathbf{x}^T \boldsymbol{\beta}}\) and include \(\lambda \log y_j\) as offset.
Setting the derivative wrt \(\lambda\) to 0 \[ \frac{\partial \ell}{\partial \lambda} = \sum_{j=1}^n \frac{\delta_j}{\lambda} + \delta_j \log y_j - (\log y_j) (\alpha y_j^{\lambda} e^{\mathbf{x}^T \boldsymbol{\beta}}) = 0 \] suggests an update \[ \widehat{\lambda} = \frac{\sum_j \delta_j}{\sum_{j=1}^n (\mu_j - \delta_j) \log y_j}. \] This suggests an estimation procedure that alternately updates \(\lambda\) by above equation and then updates \(\boldsymbol{\beta}\) by a Poisson regression with offset \(\lambda \log (y_i)\). The damping update recommended by Aitkin and Clayton (1980) improves the convergence \[ \lambda^{(k)} = \frac{\lambda^{(k-1)} + \widehat{\lambda}}{2}. \]
Let’s manually implement the estimation procedure, starting from the exponential model with \(\lambda = 1\).
sumdelta <- sum(gehan$cens)
lambda <- 1
for (iter in 1:10) {
pmod <- glm(cens ~ treat + offset(lambda * log(time)), family = poisson, data = gehan)
cat(iter, "lambda=", lambda, "\t", "beta(treatment)=", coef(pmod)[2], "\n")
mu <- predict(pmod, type = "response")
lambda_new <- sumdelta / sum((mu - gehan$cens) * log(gehan$time))
lambda = (lambda + lambda_new) / 2
}
## 1 lambda= 1 beta(treatment)= -1.526614
## 2 lambda= 1.315794 beta(treatment)= -1.703675
## 3 lambda= 1.35468 beta(treatment)= -1.724859
## 4 lambda= 1.363214 beta(treatment)= -1.729492
## 5 lambda= 1.365169 beta(treatment)= -1.730553
## 6 lambda= 1.365621 beta(treatment)= -1.730798
## 7 lambda= 1.365726 beta(treatment)= -1.730855
## 8 lambda= 1.36575 beta(treatment)= -1.730868
## 9 lambda= 1.365756 beta(treatment)= -1.730871
## 10 lambda= 1.365757 beta(treatment)= -1.730872
summary(pmod)
##
## Call:
## glm(formula = cens ~ treat + offset(lambda * log(time)), family = poisson,
## data = gehan)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5148 -0.8408 0.2221 1.0481 2.0577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.0707 0.2182 -14.072 < 2e-16 ***
## treat6-MP -1.7309 0.3984 -4.344 1.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 74.045 on 41 degrees of freedom
## Residual deviance: 52.831 on 40 degrees of freedom
## AIC: 116.83
##
## Number of Fisher Scoring iterations: 6
Note the inference (standard errors) is for the Poisson model. Correct inference can be obtained from the information matrix of \((\lambda, \boldsymbol{\beta})\). Exercise: interpret the coefficient \(\beta_{\text{treatment}} = -1.73\).
survival
function in R fits a AFT model. The coefficients from an AFT model can be translated to PH model. Details are omitted here (learn that in a surival class.)wmod <- survreg(Surv(time, cens) ~ treat, dist = "weibull", data = gehan)
summary(wmod)
##
## Call:
## survreg(formula = Surv(time, cens) ~ treat, data = gehan, dist = "weibull")
## Value Std. Error z p
## (Intercept) 2.248 0.166 13.55 < 2e-16
## treat6-MP 1.267 0.311 4.08 4.5e-05
## Log(scale) -0.312 0.147 -2.12 0.034
##
## Scale= 0.732
##
## Weibull distribution
## Loglik(model)= -106.6 Loglik(intercept only)= -116.4
## Chisq= 19.65 on 1 degrees of freedom, p= 9.3e-06
## Number of Newton-Raphson Iterations: 5
## n= 42
tibble(surv = kmfit$surv,
time = kmfit$time) %>%
ggplot() +
geom_point(mapping = aes(x = time, y = - log(surv))) +
labs(x = "Event time", y = "-log(Surv. Prob.)")
tibble(surv = kmfit$surv,
time = kmfit$time) %>%
ggplot() +
geom_point(mapping = aes(x = log(time), y = log(- log(surv)))) +
labs(x = "log(Event Time)", y = "log[-log(Surv. Prob.)]")
Cox-Snell residuals. For uncensored survival time of subject \(j\), \[ r_{C_j} = \widehat{H}_j(Y_j) = - \log [\widehat{S}_j(y_j)]. \] For censored observations, \(r'_{C_j} = r_{C_j} + \Delta\), where \(\Delta = 1\) or \(\log 2\). Distribution of \(r'_{C_j}\) is compared to exponential with unit mean.
Martingale residuals \[ r_{M_j} = \delta_j - r_{C_j}. \] These residuals have expected value of 0 but a negatively skewed distribution.
Deviance residuals. \[ r_{D_j} = \text{sign}(r_{M_j}) \{ -2[r_{M_j} + \delta_j \log (r_{C_j})] \}^{1/2}. \] Deviance residuals are approximately symmetrically distributed about 0.