Contingency Tables (ELMR Chapter 6)

Two-by-two tables
Larger two-way tables
Three-way contingency tables

Display system information and load tidyverse and faraway packages

sessionInfo()

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.13       digest_0.6.25  
## [13] rlang_0.4.5     evaluate_0.14

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(faraway)

faraway package contains the datasets in the ELMR book.

Two-by-two tables

Image source: https://www.gep.com/mind/blog/outlook-for-the-global-semiconductor-silicon-wafer-industry

Semiconductor wafer quality data. The research question is whether the presence of particles on the die affects the quality of the wafer being produced.

(wafer <- tibble(
  y        = c(320, 14, 80, 36),
  particle = gl(2, 1, length = 4, labels = c("no", "yes")),
  quality  = gl(2, 2, labels = c("good", "bad"))))

## # A tibble: 4 x 3
##       y particle quality
##   <dbl> <fct>    <fct>  
## 1   320 no       good   
## 2    14 yes      good   
## 3    80 no       bad    
## 4    36 yes      bad

This kind of data is conveniently presented as a \(2 \times 2\) contingency table.

(ytable <- xtabs(y ~ particle + quality, data = wafer))

##         quality
## particle good bad
##      no   320  80
##      yes   14  36

The usual Pearson \(X^2\) test compares the test statistic \[ X^2 = \sum_{i, j} \frac{(y_{ij} - \widehat \mu_{ij})^2}{\widehat \mu_{ij}}, \] where \[ \widehat \mu_{ij} = n \widehat p_i \widehat p_j = n \frac{\sum_j y_{ij}}{n} \frac{\sum_i y_{ij}}{n}, \] to \(\chi_1^2\).

summary(ytable)

## Call: xtabs(formula = y ~ particle + quality, data = wafer)
## Number of cases in table: 450 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 62.81, df = 1, p-value = 2.274e-15

We will analyze this data using 4 different models according to 4 different sampling schemes. They all reach the same conclusion.

Poisson model

We observe the manufacturing process for a certain period of time and observe 450 wafers. The data are then cross-classified. We model the observed counts by Poisson.

library(gtsummary)
glm(y ~ particle + quality, family = poisson, data = wafer) %>%
  #tbl_regression()
  summary()

## 
## Call:
## glm(formula = y ~ particle + quality, family = poisson, data = wafer)
## 
## Deviance Residuals: 
##      1       2       3       4  
##  1.324  -4.350  -2.370   5.266  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   5.6934     0.0572  99.535   <2e-16 ***
## particleyes  -2.0794     0.1500 -13.863   <2e-16 ***
## qualitybad   -1.0575     0.1078  -9.813   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 474.10  on 3  degrees of freedom
## Residual deviance:  54.03  on 1  degrees of freedom
## AIC: 83.774
## 
## Number of Fisher Scoring iterations: 5

The null (intercept-only model) model assumes that the rate is constant across particle and quality levels. The analysis of deviance compares \(474.10 - 54.03\) to 2 degrees of freedom, indicating null model is rejected in favor of our two predictor Poisson model.
If we add an interaction term particle * quality, the model costs 4 parameters, the same number as the observations. It is equivalent to the full/saturated model. Analysis of deviance compares 54.03 on 1 degree of freedom, indicating the interaction term is highly significant.
Therefore our conclusion is that presence of particle in the die significantly affects the quality of wafer.

Multinomial model

If we assume the total sample size is fixed at \(n=450\). Then we can model the counts by a multinomial model. We consdier two models
- \(H_1\): one full/saturated model where each cell of the \(2 \times 2\) has its own probability \(p_{ij}\), and
- \(H_0\): one reduced model that assumes particle is independent of quality, thus \(p_{ij} = p_i p_j\).
For the full/saturated model \(H_1\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j p_{ij}^{y_{ij}}. \] We estimate \(p_{ij}\) by the cell proportion (MLE) \[ \widehat{p}_{ij} = \frac{y_{ij}}{n}. \]
For the reduced model assuming independence \(H_0\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j (p_i p_j)^{y_{ij}} = \frac{n!}{\prod_i \prod_j y_{ij}!} p_i^{\sum_j y_{ij}} p_j^{\sum_i y_{ij}}. \] The MLE is \[ \widehat{p}_i = \frac{\sum_j y_{ij}}{n}, \quad \widehat{p}_j = \frac{\sum_i y_{ij}}{n}. \]
The analysis deviance compares the deviance \[\begin{eqnarray*} D &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n} - 2 \sum_i \left(\sum_j y_{ij}\right) \log \widehat{p}_i - 2 \sum_j \left(\sum_i y_{ij}\right) \log \widehat{p}_j \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n \widehat{p}_i \widehat{p}_j} \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{\widehat{\mu}_{ij}} \end{eqnarray*}\] to 1 degree of freedom.

(partp <- xtabs(y ~ particle, data = wafer) %>% prop.table())

## particle
##        no       yes 
## 0.8888889 0.1111111

(qualp <- xtabs(y ~ quality, data = wafer)  %>% prop.table())

## quality
##      good       bad 
## 0.7422222 0.2577778

muhat <- 450 * outer(partp, qualp)
ytable <- xtabs(y ~ particle + quality, data = wafer)
2 * sum(ytable * log(ytable / muhat))

## [1] 54.03045

We get the exact same result as the analysis of deviance in the Poisson model.
This connection between Poisson and multinomial is no surprise due to the following fact. If \(Y_1, \ldots, Y_k\) are independent Poisson with means \(\lambda_1, \ldots, \lambda_k\), then the joint distribution of \(Y_1, \ldots, Y_k \mid \sum_i Y_i = n\) is multinomial with probabilities \(p_j = \lambda_j / \sum_i \lambda_i\).

Binomial

If we view particle as a predictor affecting whether a wafer is good quality or bad quality, we end up with an independent binomial model.
The null (intercept-only) model corresponds to the hypothesis particle does not affect quality.

tibble(good     = c(320, 14),
       bad      = c(80, 36),
       particle = c("no", "yes")) %>%
  print() %>%
  glm(cbind(good, bad) ~ 1, family = binomial, data = .) %>%
  summary()

## # A tibble: 2 x 3
##    good   bad particle
##   <dbl> <dbl> <chr>   
## 1   320    80 no      
## 2    14    36 yes

## 
## Call:
## glm(formula = cbind(good, bad) ~ 1, family = binomial, data = .)
## 
## Deviance Residuals: 
##      1       2  
##  2.715  -6.831  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0576     0.1078   9.813   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54.03  on 1  degrees of freedom
## Residual deviance: 54.03  on 1  degrees of freedom
## AIC: 66.191
## 
## Number of Fisher Scoring iterations: 3

The alternative model corresponds to the full/saturated model where particle is included as a predictor.

Again we observe the exactly same analysis of deviance inference.
When there are more than two rows or columns, this model is called the product binoimal/multinomial model.

Hypergeometric

Finally if we fix both row and column marginal totals, the probability of the observed table is \[\begin{eqnarray*} p(y_{11}, y_{12}, y_{21}, y_{22}) &=& \frac{\binom{y_{1\cdot}}{y_{11}} \binom{y_{2\cdot}}{y_{22}} \binom{y_{\cdot 1}}{y_{21}} \binom{y_{\cdot 2}}{y_{12}}}{\binom{n}{y_{11} \, y_{12} \, y_{21} \, y_{22}}} \\ &=& \frac{y_{1\cdot}! y_{2\cdot}! y_{\cdot 1}! y_{\cdot 2}!}{y_{11}! y_{12}! y_{21}! y_{22}! n!} \\ &=& \frac{\binom{y_{1 \cdot}}{y_{11}} \binom{y_{2 \cdot}}{y_{\cdot 1} - y_{11}}}{\binom{n}{y_{\cdot 1}}}. \end{eqnarray*}\]
Under the null hypothesis that particle is not associated with quality, the Fisher’s exact test calculates the p-value by summing over the probabilities of tables with more extreme observations \[ \sum_{y_{11} \ge 320} p(y_{11}, y_{12}, y_{21}, y_{22}). \]

fisher.test(ytable)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  ytable
## p-value = 2.955e-13
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   5.090628 21.544071
## sample estimates:
## odds ratio 
##   10.21331

The odds ratio is \[ \frac{\pi_{\text{no particle}} / (1 - \pi_{\text{no particle}})}{\pi_{\text{particle}} / (1 - \pi_{\text{particle}})} = \frac{y_{11} y_{22}}{y_{12} y_{21}} \]

(320 * 36) / (14 * 80)

## [1] 10.28571

Larger two-way tables

The haireye data set contains data on 592 statistics students cross-classifed by hair and eye color.

haireye <- as_tibble(haireye) %>%
  print(n = Inf)

## # A tibble: 16 x 3
##        y eye   hair 
##    <dbl> <fct> <fct>
##  1     5 green BLACK
##  2    29 green BROWN
##  3    14 green RED  
##  4    16 green BLOND
##  5    15 hazel BLACK
##  6    54 hazel BROWN
##  7    14 hazel RED  
##  8    10 hazel BLOND
##  9    20 blue  BLACK
## 10    84 blue  BROWN
## 11    17 blue  RED  
## 12    94 blue  BLOND
## 13    68 brown BLACK
## 14   119 brown BROWN
## 15    26 brown RED  
## 16     7 brown BLOND

(haireye_table <- xtabs(y ~ hair + eye, data = haireye))

##        eye
## hair    green hazel blue brown
##   BLACK     5    15   20    68
##   BROWN    29    54   84   119
##   RED      14    14   17    26
##   BLOND    16    10   94     7

Graphical summary of the contingency table by a mosaic plot. If eye and hair are independent, we expect to see a grid.

mosaicplot(haireye_table, color = TRUE, main = NULL, las = 1)

Pearson \(X^2\) test for independence yields

summary(haireye_table)

## Call: xtabs(formula = y ~ hair + eye, data = haireye)
## Number of cases in table: 592 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 138.29, df = 9, p-value = 2.325e-25

We fit a Poisson GLM:

modc <- glm(y ~ hair + eye, family = poisson, data = haireye)
summary(modc)

## 
## Call:
## glm(formula = y ~ hair + eye, family = poisson, data = haireye)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -7.326  -2.065  -0.212   1.235   6.172  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.4575     0.1523  16.136  < 2e-16 ***
## hairBROWN     0.9739     0.1129   8.623  < 2e-16 ***
## hairRED      -0.4195     0.1528  -2.745  0.00604 ** 
## hairBLOND     0.1621     0.1309   1.238  0.21569    
## eyehazel      0.3737     0.1624   2.301  0.02139 *  
## eyeblue       1.2118     0.1424   8.510  < 2e-16 ***
## eyebrown      1.2347     0.1420   8.694  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 453.31  on 15  degrees of freedom
## Residual deviance: 146.44  on  9  degrees of freedom
## AIC: 241.04
## 
## Number of Fisher Scoring iterations: 5

which clearly shows a lack of fit. The interaction model is equivalent to the full/saturated model. Therefore we see strong evidence for the dependence between eye and hair.

We follow up by a correspondence analysis to study how eye and hair are dependent.

We first compute the matrix of Pearson residuals \(\mathbf{R}\).

(R <- xtabs(residuals(modc, type = "pearson") ~ hair + eye, data = haireye))

##        eye
## hair          green       hazel        blue       brown
##   BLACK -1.95368354 -0.47735203 -3.06937747  4.39839852
##   BROWN -0.34509961  1.35328398 -1.94947682  1.23345810
##   RED    2.28273672  0.85225273 -1.73012546 -0.07497794
##   BLOND  0.61269815 -2.22784430  7.04959022 -5.85099741

Let the singular value decomposition (SVD) of the Pearson residual matrix be \(\mathbf{R} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T\). Then the best rank-2 approximation to the residual matrix is \[\begin{eqnarray*} \mathbf{R} &\approx& \sigma_1 \mathbf{u}_1 \mathbf{v}_1^T + \sigma_2 \mathbf{u}_2 \mathbf{v}_2^T \\ &=& (\sqrt{\sigma_1} \mathbf{u}_1) (\sqrt{\sigma_1} \mathbf{v}_1)^T + (\sqrt{\sigma_2} \mathbf{u}_2)(\sqrt{\sigma_2} \mathbf{v}_2)^T \\ &=& \tilde{\mathbf{u}}_1 \tilde{\mathbf{v}}_1^T + \tilde{\mathbf{u}}_2 \tilde{\mathbf{v}}_2^T, \end{eqnarray*}\] where \((\sigma_i, \mathbf{u}_i, \mathbf{v}_i)\), \(i=1,2\), are the top singular values and vectors. For refresher of SVD, read Biostat 216 slides.

(svdr <- svd(R, nu = 2, nv = 2))

## $d
## [1] 1.111726e+01 3.627417e+00 1.240273e+00 1.678879e-10
## 
## $u
##             [,1]       [,2]
## [1,] -0.47166009  0.6154461
## [2,] -0.22552151 -0.1522951
## [3,] -0.09817011 -0.7424993
## [4,]  0.84678181  0.2161646
## 
## $v
##            [,1]       [,2]
## [1,]  0.1163980 -0.7477267
## [2,] -0.1844169 -0.4450167
## [3,]  0.7220003  0.3353208
## [4,] -0.6566258  0.3611439

Finally we plot \(\tilde{\mathbf{u}}_2\) against \(\tilde{\mathbf{u}}_1\) and \(\tilde{\mathbf{v}}_2\) against \(\tilde{\mathbf{v}}_1\) on a correspondence analysis plot.

cplot_df <- tibble(
  dim1  = sqrt(svdr$d[1]) * c(svdr$u[, 1], svdr$v[, 1]),
  dim2  = sqrt(svdr$d[2]) * c(svdr$u[, 2], svdr$v[, 2]),
  label = c(rownames(R), colnames(R)),
  var   = rep(c("hair", "eye"), each = 4)) %>%
  print(n = Inf)

## # A tibble: 8 x 4
##     dim1   dim2 label var  
##    <dbl>  <dbl> <chr> <chr>
## 1 -1.57   1.17  BLACK hair 
## 2 -0.752 -0.290 BROWN hair 
## 3 -0.327 -1.41  RED   hair 
## 4  2.82   0.412 BLOND hair 
## 5  0.388 -1.42  green eye  
## 6 -0.615 -0.848 hazel eye  
## 7  2.41   0.639 blue  eye  
## 8 -2.19   0.688 brown eye

library(ggrepel)
# percent of variance explained
pve1 <- svdr$d[1] / sum(svdr$d)
pve2 <- svdr$d[2] / sum(svdr$d)
cplot_df %>%
  ggplot(mapping = aes(x = dim1, y = dim2, color = var, shape = var)) + 
  geom_point() +
  geom_label_repel(mapping = aes(label = label), show.legend = F) +
  geom_vline(xintercept = 0, lty = "dashed", alpha = 0.5) +
  geom_hline(yintercept = 0, lty = "dashed", alpha = 0.5) +
  labs(x = str_c("Dimension 1 (", signif(pve1 * 100, 3), "%)"),
       y = str_c("Dimension 2 (", signif(pve2 * 100, 3), "%)"))

To read the correspondence analysis plot:
- Perason’s \(X^2 = \sum_{i,j} r_{ij}^2 = \sum_k \sigma_k^2\) is called the inertia.
- Large values of \(|\tilde{u}_i|\) or \(|\tilde{v}_i|\) indicates that level is a typical. For example, BLONDE hair.
- If row and column levels appear close together and far from the origin, then there is a strong positive association. For example, BLONDE hair + blue eye. If they are diametrically apart on either side of the origin, then there is a strong negative association. For example, BLONDE hair + brown eye.
- If two row levels or column levels are close together, it indicates that the two levels have similar pattern of association,. In some cases, one might consider combining the two levels. For example, hazel and green eyes.

Three-way contingency tables

The femsoke data records female smokers and non-smokers, their age group, and whether the subjects were dead or still alive after 20 years.

femsmoke <- as_tibble(femsmoke) %>%
  print(n = Inf)

## # A tibble: 28 x 4
##        y smoker dead  age  
##    <dbl> <fct>  <fct> <fct>
##  1     2 yes    yes   18-24
##  2     1 no     yes   18-24
##  3     3 yes    yes   25-34
##  4     5 no     yes   25-34
##  5    14 yes    yes   35-44
##  6     7 no     yes   35-44
##  7    27 yes    yes   45-54
##  8    12 no     yes   45-54
##  9    51 yes    yes   55-64
## 10    40 no     yes   55-64
## 11    29 yes    yes   65-74
## 12   101 no     yes   65-74
## 13    13 yes    yes   75+  
## 14    64 no     yes   75+  
## 15    53 yes    no    18-24
## 16    61 no     no    18-24
## 17   121 yes    no    25-34
## 18   152 no     no    25-34
## 19    95 yes    no    35-44
## 20   114 no     no    35-44
## 21   103 yes    no    45-54
## 22    66 no     no    45-54
## 23    64 yes    no    55-64
## 24    81 no     no    55-64
## 25     7 yes    no    65-74
## 26    28 no     no    65-74
## 27     0 yes    no    75+  
## 28     0 no     no    75+

There are three factors smoker, dead, and age. If we just classify over smoker and dead

(ct <- xtabs(y ~ smoker + dead, data = femsmoke))

##       dead
## smoker yes  no
##    yes 139 443
##    no  230 502

prop.table(ct, 1)

##       dead
## smoker       yes        no
##    yes 0.2388316 0.7611684
##    no  0.3142077 0.6857923

we see significantly higher percentage of non-smokers died than smokers.

summary(ct)

## Call: xtabs(formula = y ~ smoker + dead, data = femsmoke)
## Number of cases in table: 1314 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 9.121, df = 1, p-value = 0.002527

However smoke status is confounded with the age group. Smokers are more concentrated in the younger groups and younger people are more likely to live for another 20 years.

prop.table(xtabs(y ~ smoker + age, data = femsmoke), 2)

##       age
## smoker     18-24     25-34     35-44     45-54     55-64     65-74       75+
##    yes 0.4700855 0.4412811 0.4739130 0.6250000 0.4872881 0.2181818 0.1688312
##    no  0.5299145 0.5587189 0.5260870 0.3750000 0.5127119 0.7818182 0.8311688

The odds ratios in all age groups are:

ct3 <- xtabs(y ~ smoker + dead + age, data = femsmoke)
apply(ct3, 3, function (x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))

##    18-24    25-34    35-44    45-54    55-64    65-74      75+ 
## 2.301887 0.753719 2.400000 1.441748 1.613672 1.148515      NaN

The Cochran-Mantel-Haenszel (CMH) tests independence in \(2 \times 2\) tables across \(K\) strata. It compares the statistic \[ \frac{(|\sum_k y_{11k} - \sum_k \mathbb{E}y_{11k}| - 1/2)^2}{\sum_k \operatorname{Var} y_{11k}}, \] where the expectation and variance are computed under the null hypothesis of independence in each stratum, to asymptotic null distribution \(\chi_1^2\) and calculate the p-value by exact test.

mantelhaen.test(ct3, exact = TRUE)

## 
##  Exact conditional test of independence in 2 x 2 x k tables
## 
## data:  ct3
## S = 139, p-value = 0.01591
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.068889 2.203415
## sample estimates:
## common odds ratio 
##          1.530256

Mutual independence model

Under mutual independence, \[ p_{ijk} = p_i p_j p_k \] so \[ \mathbb{E} Y_{ijk} = n p_{ijk} \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k. \] So the main effect-only model corresponds to the mutual independence model.
We can test independence by the Pearson \(X^2\) test

summary(ct3)

## Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke)
## Number of cases in table: 1314 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 790.6, df = 19, p-value = 2.14e-155

or by a Poisson GLM

modi <- glm(y ~ smoker + dead + age, family = poisson, data = femsmoke)
summary(modi)

## 
## Call:
## glm(formula = y ~ smoker + dead + age, family = poisson, data = femsmoke)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.9306  -5.3175  -0.5514   2.4229  11.1895  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.67778    0.10702  25.021  < 2e-16 ***
## smokerno     0.22931    0.05554   4.129 3.64e-05 ***
## deadno       0.94039    0.06139  15.319  < 2e-16 ***
## age25-34     0.87618    0.11003   7.963 1.67e-15 ***
## age35-44     0.67591    0.11356   5.952 2.65e-09 ***
## age45-54     0.57536    0.11556   4.979 6.40e-07 ***
## age55-64     0.70166    0.11307   6.206 5.45e-10 ***
## age65-74     0.34377    0.12086   2.844  0.00445 ** 
## age75+      -0.41837    0.14674  -2.851  0.00436 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1193.9  on 27  degrees of freedom
## Residual deviance:  735.0  on 19  degrees of freedom
## AIC: 887.2
## 
## Number of Fisher Scoring iterations: 6

Both suggest a lack of fit.

modi %>% tbl_regression(exponentiate = TRUE, intercept = TRUE)

Characteristic	IRR¹	95% CI¹	p-value
(Intercept)	14.6	11.7, 17.9	<0.001
smoker
yes	—	—
no	1.26	1.13, 1.40	<0.001
dead
yes	—	—
no	2.56	2.27, 2.89	<0.001
age
18-24	—	—
25-34	2.40	1.94, 2.99	<0.001
35-44	1.97	1.58, 2.46	<0.001
45-54	1.78	1.42, 2.24	<0.001
55-64	2.02	1.62, 2.53	<0.001
65-74	1.41	1.11, 1.79	0.004
75+	0.66	0.49, 0.88	0.004
¹ IRR = Incidence Rate Ratio, CI = Confidence Interval

The exponentiated coefficients are just marginal proportions

c(1, 1.26) / (1 + 1.26)

## [1] 0.4424779 0.5575221

prop.table(xtabs(y ~ smoker, femsmoke))

## smoker
##       yes        no 
## 0.4429224 0.5570776

Joint independence

If we assume the first two variables are dependent, and jointly independent of the third. Then \[ p_{ijk} = p_{ij} p_k \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_{ij} + \log p_k. \]

It leads to a log-linear model with just one interaction

glm(y ~ smoker * dead + age, family = poisson, data = femsmoke) %>%
  summary()

## 
## Call:
## glm(formula = y ~ smoker * dead + age, family = poisson, data = femsmoke)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.2606  -5.1564  -0.5933   2.5373  10.4236  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.51582    0.12239  20.555  < 2e-16 ***
## smokerno         0.50361    0.10743   4.688 2.76e-06 ***
## deadno           1.15910    0.09722  11.922  < 2e-16 ***
## age25-34         0.87618    0.11003   7.963 1.67e-15 ***
## age35-44         0.67591    0.11356   5.952 2.65e-09 ***
## age45-54         0.57536    0.11556   4.979 6.40e-07 ***
## age55-64         0.70166    0.11307   6.206 5.45e-10 ***
## age65-74         0.34377    0.12086   2.844  0.00445 ** 
## age75+          -0.41837    0.14674  -2.851  0.00436 ** 
## smokerno:deadno -0.37858    0.12566  -3.013  0.00259 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1193.9  on 27  degrees of freedom
## Residual deviance:  725.8  on 18  degrees of freedom
## AIC: 880
## 
## Number of Fisher Scoring iterations: 6

It improves the fit, but still not enough.

Conditional independence

Let \(P_{ij \mid k}\) be the probability that an observation falls in \((i, j)\)-cell given that we know the third variables takes value \(k\). If we assume that first two variables are independent give value of the third. Then \[ p_{ijk} = p_{ij\mid k} p_k = p_{i\mid k} p_{j \mid k} p_k = \frac{p_{ik} p_{jk}}{p_k}. \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_{ik} + \log p_{jk} - \log p_k. \]
It leads to a model with two interaction terms

glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
  summary()

## 
## Call:
## glm(formula = y ~ smoker * age + dead * age, family = poisson, 
##     data = femsmoke)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.30657  -0.26480  -0.00003   0.26643   1.20822  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.34377    0.58563   0.587 0.557199    
## smokerno             0.11980    0.18523   0.647 0.517785    
## age25-34             0.91760    0.68737   1.335 0.181895    
## age35-44             1.95402    0.62882   3.107 0.001887 ** 
## age45-54             2.84979    0.60950   4.676 2.93e-06 ***
## age55-64             3.44819    0.59868   5.760 8.43e-09 ***
## age65-74             3.00134    0.61023   4.918 8.73e-07 ***
## age75+               2.22118    0.64799   3.428 0.000609 ***
## deadno               3.63759    0.58490   6.219 5.00e-10 ***
## smokerno:age25-34    0.11616    0.22078   0.526 0.598789    
## smokerno:age35-44   -0.01536    0.22749  -0.068 0.946172    
## smokerno:age45-54   -0.63063    0.23414  -2.693 0.007074 ** 
## smokerno:age55-64   -0.06894    0.22643  -0.304 0.760765    
## smokerno:age65-74    1.15649    0.26427   4.376 1.21e-05 ***
## smokerno:age75+      1.47413    0.35617   4.139 3.49e-05 ***
## age25-34:deadno     -0.10756    0.68613  -0.157 0.875435    
## age35-44:deadno     -1.33977    0.62810  -2.133 0.032920 *  
## age45-54:deadno     -2.17125    0.61128  -3.552 0.000382 ***
## age55-64:deadno     -3.17171    0.59999  -5.286 1.25e-07 ***
## age65-74:deadno     -4.94977    0.61512  -8.047 8.49e-16 ***
## age75+:deadno      -26.30450 5776.51889  -0.005 0.996367    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1193.9378  on 27  degrees of freedom
## Residual deviance:    8.3269  on  7  degrees of freedom
## AIC: 184.52
## 
## Number of Fisher Scoring iterations: 17

This model fits data pretty well.

Significance of predictors

glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
  drop1(test = "Chi")

## Single term deletions
## 
## Model:
## y ~ smoker * age + dead * age
##            Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>            8.33 184.52                     
## smoker:age  6   101.83 266.03  93.51 < 2.2e-16 ***
## age:dead    6   641.50 805.69 633.17 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Uniform association

If we conisder a model with all two-way interactions \[ \log \mathbb{E}Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k + \log p_{ij} + \log p_{ik} + \log p_{jk}. \]
There is no simple interpretation in terms of independence.

modu <- glm(y ~ (smoker + age + dead)^2, family = poisson, data = femsmoke)
summary(modu)

## 
## Call:
## glm(formula = y ~ (smoker + age + dead)^2, family = poisson, 
##     data = femsmoke)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.70006  -0.11004  -0.00002   0.12254   0.67272  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.54284    0.58736   0.924 0.355384    
## smokerno            -0.29666    0.25324  -1.171 0.241401    
## age25-34             0.92902    0.68381   1.359 0.174273    
## age35-44             1.94048    0.62486   3.105 0.001900 ** 
## age45-54             2.76845    0.60657   4.564 5.02e-06 ***
## age55-64             3.37507    0.59550   5.668 1.45e-08 ***
## age65-74             2.86586    0.60894   4.706 2.52e-06 ***
## age75+               2.02211    0.64955   3.113 0.001851 ** 
## deadno               3.43271    0.59014   5.817 6.00e-09 ***
## smokerno:age25-34    0.11752    0.22091   0.532 0.594749    
## smokerno:age35-44    0.01268    0.22800   0.056 0.955654    
## smokerno:age45-54   -0.56538    0.23585  -2.397 0.016522 *  
## smokerno:age55-64    0.08512    0.23573   0.361 0.718030    
## smokerno:age65-74    1.49088    0.30039   4.963 6.93e-07 ***
## smokerno:age75+      1.89060    0.39582   4.776 1.78e-06 ***
## smokerno:deadno      0.42741    0.17703   2.414 0.015762 *  
## age25-34:deadno     -0.12006    0.68655  -0.175 0.861178    
## age35-44:deadno     -1.34112    0.62857  -2.134 0.032874 *  
## age45-54:deadno     -2.11336    0.61210  -3.453 0.000555 ***
## age55-64:deadno     -3.18077    0.60057  -5.296 1.18e-07 ***
## age65-74:deadno     -5.08798    0.61951  -8.213  < 2e-16 ***
## age75+:deadno      -27.31727 8839.01146  -0.003 0.997534    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1193.9378  on 27  degrees of freedom
## Residual deviance:    2.3809  on  6  degrees of freedom
## AIC: 180.58
## 
## Number of Fisher Scoring iterations: 18

If we compute the odds ratio for each age group

xtabs(fitted(modu) ~ smoker + dead + age, data = femsmoke) %>%
  apply(3, function(x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))

##    18-24    25-34    35-44    45-54    55-64    65-74      75+ 
## 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275

Thus the name uniform association model.

The odds ratio can also be extracted from the coefficients

 exp(coef(modu)['smokerno:deadno'])

## smokerno:deadno 
##        1.533275

Model selection

We can start from the saturated model (3-way interaction) and use the analysis of deviance to see how much we can reduce the model.

glm(y ~ smoker * age * dead, family = poisson, data = femsmoke) %>%
  step(test = "Chi") %>%
  drop1(test = "Chi")

## Start:  AIC=190.19
## y ~ smoker * age * dead
## 
##                   Df Deviance    AIC    LRT Pr(>Chi)
## - smoker:age:dead  6   2.3809 180.58 2.3809   0.8815
## <none>                 0.0000 190.19                
## 
## Step:  AIC=180.58
## y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead
## 
##               Df Deviance    AIC    LRT Pr(>Chi)    
## <none>               2.38 180.58                    
## - smoker:dead  1     8.33 184.52   5.95  0.01475 *  
## - smoker:age   6    92.63 258.83  90.25  < 2e-16 ***
## - age:dead     6   632.30 798.49 629.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Single term deletions
## 
## Model:
## y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead
##             Df Deviance    AIC    LRT Pr(>Chi)    
## <none>             2.38 180.58                    
## smoker:age   6    92.63 258.83  90.25  < 2e-16 ***
## smoker:dead  1     8.33 184.52   5.95  0.01475 *  
## age:dead     6   632.30 798.49 629.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction term smoker:dead is weakly significant in the final model. This corresponds to the test of conditional independence between smoke and dead given age group. The p-value is very similar to that from the CMH test.

Binomial

It also makes intuitive sense to model life status as the outcome depending on perdictors smoker and age.

# glm(dead ~ smoker * age, family = binomial, weights = y, data = femsmoke) %>%
#   summary()
glm(matrix(femsmoke$y, ncol = 2) ~ smoker * age, family = binomial, femsmoke[1:14, ]) %>%
  step(test = "Chi") %>%
  summary()

## Start:  AIC=75
## matrix(femsmoke$y, ncol = 2) ~ smoker * age
## 
##              Df Deviance    AIC    LRT Pr(>Chi)
## - smoker:age  6   2.3809 65.377 2.3809   0.8815
## <none>            0.0000 74.996                
## 
## Step:  AIC=65.38
## matrix(femsmoke$y, ncol = 2) ~ smoker + age
## 
##          Df Deviance    AIC    LRT Pr(>Chi)    
## <none>          2.38  65.38                    
## - smoker  1     8.33  69.32   5.95  0.01475 *  
## - age     6   632.30 683.29 629.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## Call:
## glm(formula = matrix(femsmoke$y, ncol = 2) ~ smoker + age, family = binomial, 
##     data = femsmoke[1:14, ])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.72545  -0.22836   0.00005   0.19146   0.68162  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.4327     0.5901  -5.817 6.00e-09 ***
## smokerno       -0.4274     0.1770  -2.414 0.015762 *  
## age25-34        0.1201     0.6865   0.175 0.861178    
## age35-44        1.3411     0.6286   2.134 0.032874 *  
## age45-54        2.1134     0.6121   3.453 0.000555 ***
## age55-64        3.1808     0.6006   5.296 1.18e-07 ***
## age65-74        5.0880     0.6195   8.213  < 2e-16 ***
## age75+         27.8073 11293.1430   0.002 0.998035    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 641.4963  on 13  degrees of freedom
## Residual deviance:   2.3809  on  6  degrees of freedom
## AIC: 65.377
## 
## Number of Fisher Scoring iterations: 20

The final model retains only main effects. This is equivalent to the uniform association model. The deviance is exactly the same.

Contingency Tables (ELMR Chapter 6)

Biostat 200C

Dr. Hua Zhou @ UCLA

Apr 21, 2020

Two-by-two tables

Poisson model

Multinomial model

Binomial

Hypergeometric

Larger two-way tables

Three-way contingency tables

Mutual independence model

Joint independence

Conditional independence

Uniform association

Model selection

Binomial