Display system information and load tidyverse and faraway packages

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.13       digest_0.6.25  
## [13] rlang_0.4.5     evaluate_0.14
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(faraway)

faraway package contains the datasets in the ELMR book.

Two-by-two tables

Image source: https://www.gep.com/mind/blog/outlook-for-the-global-semiconductor-silicon-wafer-industry

(wafer <- tibble(
  y        = c(320, 14, 80, 36),
  particle = gl(2, 1, length = 4, labels = c("no", "yes")),
  quality  = gl(2, 2, labels = c("good", "bad"))))
## # A tibble: 4 x 3
##       y particle quality
##   <dbl> <fct>    <fct>  
## 1   320 no       good   
## 2    14 yes      good   
## 3    80 no       bad    
## 4    36 yes      bad

This kind of data is conveniently presented as a \(2 \times 2\) contingency table.

(ytable <- xtabs(y ~ particle + quality, data = wafer))
##         quality
## particle good bad
##      no   320  80
##      yes   14  36
summary(ytable)
## Call: xtabs(formula = y ~ particle + quality, data = wafer)
## Number of cases in table: 450 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 62.81, df = 1, p-value = 2.274e-15

Poisson model

  • We observe the manufacturing process for a certain period of time and observe 450 wafers. The data are then cross-classified. We model the observed counts by Poisson.
library(gtsummary)
glm(y ~ particle + quality, family = poisson, data = wafer) %>%
  #tbl_regression()
  summary()
## 
## Call:
## glm(formula = y ~ particle + quality, family = poisson, data = wafer)
## 
## Deviance Residuals: 
##      1       2       3       4  
##  1.324  -4.350  -2.370   5.266  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   5.6934     0.0572  99.535   <2e-16 ***
## particleyes  -2.0794     0.1500 -13.863   <2e-16 ***
## qualitybad   -1.0575     0.1078  -9.813   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 474.10  on 3  degrees of freedom
## Residual deviance:  54.03  on 1  degrees of freedom
## AIC: 83.774
## 
## Number of Fisher Scoring iterations: 5
  • The null (intercept-only model) model assumes that the rate is constant across particle and quality levels. The analysis of deviance compares \(474.10 - 54.03\) to 2 degrees of freedom, indicating null model is rejected in favor of our two predictor Poisson model.

  • If we add an interaction term particle * quality, the model costs 4 parameters, the same number as the observations. It is equivalent to the full/saturated model. Analysis of deviance compares 54.03 on 1 degree of freedom, indicating the interaction term is highly significant.

  • Therefore our conclusion is that presence of particle in the die significantly affects the quality of wafer.

Multinomial model

  • If we assume the total sample size is fixed at \(n=450\). Then we can model the counts by a multinomial model. We consdier two models

    • \(H_1\): one full/saturated model where each cell of the \(2 \times 2\) has its own probability \(p_{ij}\), and

    • \(H_0\): one reduced model that assumes particle is independent of quality, thus \(p_{ij} = p_i p_j\).

  • For the full/saturated model \(H_1\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j p_{ij}^{y_{ij}}. \] We estimate \(p_{ij}\) by the cell proportion (MLE) \[ \widehat{p}_{ij} = \frac{y_{ij}}{n}. \]

  • For the reduced model assuming independence \(H_0\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j (p_i p_j)^{y_{ij}} = \frac{n!}{\prod_i \prod_j y_{ij}!} p_i^{\sum_j y_{ij}} p_j^{\sum_i y_{ij}}. \] The MLE is \[ \widehat{p}_i = \frac{\sum_j y_{ij}}{n}, \quad \widehat{p}_j = \frac{\sum_i y_{ij}}{n}. \]

  • The analysis deviance compares the deviance \[\begin{eqnarray*} D &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n} - 2 \sum_i \left(\sum_j y_{ij}\right) \log \widehat{p}_i - 2 \sum_j \left(\sum_i y_{ij}\right) \log \widehat{p}_j \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n \widehat{p}_i \widehat{p}_j} \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{\widehat{\mu}_{ij}} \end{eqnarray*}\] to 1 degree of freedom.

(partp <- xtabs(y ~ particle, data = wafer) %>% prop.table())
## particle
##        no       yes 
## 0.8888889 0.1111111
(qualp <- xtabs(y ~ quality, data = wafer)  %>% prop.table())
## quality
##      good       bad 
## 0.7422222 0.2577778
muhat <- 450 * outer(partp, qualp)
ytable <- xtabs(y ~ particle + quality, data = wafer)
2 * sum(ytable * log(ytable / muhat))
## [1] 54.03045
  • We get the exact same result as the analysis of deviance in the Poisson model.

  • This connection between Poisson and multinomial is no surprise due to the following fact. If \(Y_1, \ldots, Y_k\) are independent Poisson with means \(\lambda_1, \ldots, \lambda_k\), then the joint distribution of \(Y_1, \ldots, Y_k \mid \sum_i Y_i = n\) is multinomial with probabilities \(p_j = \lambda_j / \sum_i \lambda_i\).

Binomial

  • If we view particle as a predictor affecting whether a wafer is good quality or bad quality, we end up with an independent binomial model.

  • The null (intercept-only) model corresponds to the hypothesis particle does not affect quality.

tibble(good     = c(320, 14),
       bad      = c(80, 36),
       particle = c("no", "yes")) %>%
  print() %>%
  glm(cbind(good, bad) ~ 1, family = binomial, data = .) %>%
  summary()
## # A tibble: 2 x 3
##    good   bad particle
##   <dbl> <dbl> <chr>   
## 1   320    80 no      
## 2    14    36 yes
## 
## Call:
## glm(formula = cbind(good, bad) ~ 1, family = binomial, data = .)
## 
## Deviance Residuals: 
##      1       2  
##  2.715  -6.831  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0576     0.1078   9.813   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54.03  on 1  degrees of freedom
## Residual deviance: 54.03  on 1  degrees of freedom
## AIC: 66.191
## 
## Number of Fisher Scoring iterations: 3

The alternative model corresponds to the full/saturated model where particle is included as a predictor.

  • Again we observe the exactly same analysis of deviance inference.

  • When there are more than two rows or columns, this model is called the product binoimal/multinomial model.

Hypergeometric

  • Finally if we fix both row and column marginal totals, the probability of the observed table is \[\begin{eqnarray*} p(y_{11}, y_{12}, y_{21}, y_{22}) &=& \frac{\binom{y_{1\cdot}}{y_{11}} \binom{y_{2\cdot}}{y_{22}} \binom{y_{\cdot 1}}{y_{21}} \binom{y_{\cdot 2}}{y_{12}}}{\binom{n}{y_{11} \, y_{12} \, y_{21} \, y_{22}}} \\ &=& \frac{y_{1\cdot}! y_{2\cdot}! y_{\cdot 1}! y_{\cdot 2}!}{y_{11}! y_{12}! y_{21}! y_{22}! n!} \\ &=& \frac{\binom{y_{1 \cdot}}{y_{11}} \binom{y_{2 \cdot}}{y_{\cdot 1} - y_{11}}}{\binom{n}{y_{\cdot 1}}}. \end{eqnarray*}\]

  • Under the null hypothesis that particle is not associated with quality, the Fisher’s exact test calculates the p-value by summing over the probabilities of tables with more extreme observations \[ \sum_{y_{11} \ge 320} p(y_{11}, y_{12}, y_{21}, y_{22}). \]

fisher.test(ytable)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ytable
## p-value = 2.955e-13
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   5.090628 21.544071
## sample estimates:
## odds ratio 
##   10.21331

The odds ratio is \[ \frac{\pi_{\text{no particle}} / (1 - \pi_{\text{no particle}})}{\pi_{\text{particle}} / (1 - \pi_{\text{particle}})} = \frac{y_{11} y_{22}}{y_{12} y_{21}} \]

(320 * 36) / (14 * 80)
## [1] 10.28571

Larger two-way tables

haireye <- as_tibble(haireye) %>%
  print(n = Inf)
## # A tibble: 16 x 3
##        y eye   hair 
##    <dbl> <fct> <fct>
##  1     5 green BLACK
##  2    29 green BROWN
##  3    14 green RED  
##  4    16 green BLOND
##  5    15 hazel BLACK
##  6    54 hazel BROWN
##  7    14 hazel RED  
##  8    10 hazel BLOND
##  9    20 blue  BLACK
## 10    84 blue  BROWN
## 11    17 blue  RED  
## 12    94 blue  BLOND
## 13    68 brown BLACK
## 14   119 brown BROWN
## 15    26 brown RED  
## 16     7 brown BLOND
(haireye_table <- xtabs(y ~ hair + eye, data = haireye))
##        eye
## hair    green hazel blue brown
##   BLACK     5    15   20    68
##   BROWN    29    54   84   119
##   RED      14    14   17    26
##   BLOND    16    10   94     7
mosaicplot(haireye_table, color = TRUE, main = NULL, las = 1)

summary(haireye_table)
## Call: xtabs(formula = y ~ hair + eye, data = haireye)
## Number of cases in table: 592 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 138.29, df = 9, p-value = 2.325e-25

Three-way contingency tables

femsmoke <- as_tibble(femsmoke) %>%
  print(n = Inf)
## # A tibble: 28 x 4
##        y smoker dead  age  
##    <dbl> <fct>  <fct> <fct>
##  1     2 yes    yes   18-24
##  2     1 no     yes   18-24
##  3     3 yes    yes   25-34
##  4     5 no     yes   25-34
##  5    14 yes    yes   35-44
##  6     7 no     yes   35-44
##  7    27 yes    yes   45-54
##  8    12 no     yes   45-54
##  9    51 yes    yes   55-64
## 10    40 no     yes   55-64
## 11    29 yes    yes   65-74
## 12   101 no     yes   65-74
## 13    13 yes    yes   75+  
## 14    64 no     yes   75+  
## 15    53 yes    no    18-24
## 16    61 no     no    18-24
## 17   121 yes    no    25-34
## 18   152 no     no    25-34
## 19    95 yes    no    35-44
## 20   114 no     no    35-44
## 21   103 yes    no    45-54
## 22    66 no     no    45-54
## 23    64 yes    no    55-64
## 24    81 no     no    55-64
## 25     7 yes    no    65-74
## 26    28 no     no    65-74
## 27     0 yes    no    75+  
## 28     0 no     no    75+

There are three factors smoker, dead, and age. If we just classify over smoker and dead

(ct <- xtabs(y ~ smoker + dead, data = femsmoke))
##       dead
## smoker yes  no
##    yes 139 443
##    no  230 502
prop.table(ct, 1)
##       dead
## smoker       yes        no
##    yes 0.2388316 0.7611684
##    no  0.3142077 0.6857923

we see significantly higher percentage of non-smokers died than smokers.

summary(ct)
## Call: xtabs(formula = y ~ smoker + dead, data = femsmoke)
## Number of cases in table: 1314 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 9.121, df = 1, p-value = 0.002527
prop.table(xtabs(y ~ smoker + age, data = femsmoke), 2)
##       age
## smoker     18-24     25-34     35-44     45-54     55-64     65-74       75+
##    yes 0.4700855 0.4412811 0.4739130 0.6250000 0.4872881 0.2181818 0.1688312
##    no  0.5299145 0.5587189 0.5260870 0.3750000 0.5127119 0.7818182 0.8311688
ct3 <- xtabs(y ~ smoker + dead + age, data = femsmoke)
apply(ct3, 3, function (x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))
##    18-24    25-34    35-44    45-54    55-64    65-74      75+ 
## 2.301887 0.753719 2.400000 1.441748 1.613672 1.148515      NaN
mantelhaen.test(ct3, exact = TRUE)
## 
##  Exact conditional test of independence in 2 x 2 x k tables
## 
## data:  ct3
## S = 139, p-value = 0.01591
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.068889 2.203415
## sample estimates:
## common odds ratio 
##          1.530256

Mutual independence model

  • Under mutual independence, \[ p_{ijk} = p_i p_j p_k \] so \[ \mathbb{E} Y_{ijk} = n p_{ijk} \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k. \] So the main effect-only model corresponds to the mutual independence model.

  • We can test independence by the Pearson \(X^2\) test

summary(ct3)
## Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke)
## Number of cases in table: 1314 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 790.6, df = 19, p-value = 2.14e-155

or by a Poisson GLM

modi <- glm(y ~ smoker + dead + age, family = poisson, data = femsmoke)
summary(modi)
## 
## Call:
## glm(formula = y ~ smoker + dead + age, family = poisson, data = femsmoke)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.9306  -5.3175  -0.5514   2.4229  11.1895  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.67778    0.10702  25.021  < 2e-16 ***
## smokerno     0.22931    0.05554   4.129 3.64e-05 ***
## deadno       0.94039    0.06139  15.319  < 2e-16 ***
## age25-34     0.87618    0.11003   7.963 1.67e-15 ***
## age35-44     0.67591    0.11356   5.952 2.65e-09 ***
## age45-54     0.57536    0.11556   4.979 6.40e-07 ***
## age55-64     0.70166    0.11307   6.206 5.45e-10 ***
## age65-74     0.34377    0.12086   2.844  0.00445 ** 
## age75+      -0.41837    0.14674  -2.851  0.00436 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1193.9  on 27  degrees of freedom
## Residual deviance:  735.0  on 19  degrees of freedom
## AIC: 887.2
## 
## Number of Fisher Scoring iterations: 6

Both suggest a lack of fit.

modi %>% tbl_regression(exponentiate = TRUE, intercept = TRUE)

Characteristic IRR1 95% CI1 p-value
(Intercept) 14.6 11.7, 17.9 <0.001
smoker
yes
no 1.26 1.13, 1.40 <0.001
dead
yes
no 2.56 2.27, 2.89 <0.001
age
18-24
25-34 2.40 1.94, 2.99 <0.001
35-44 1.97 1.58, 2.46 <0.001
45-54 1.78 1.42, 2.24 <0.001
55-64 2.02 1.62, 2.53 <0.001
65-74 1.41 1.11, 1.79 0.004
75+ 0.66 0.49, 0.88 0.004

1 IRR = Incidence Rate Ratio, CI = Confidence Interval

The exponentiated coefficients are just marginal proportions

c(1, 1.26) / (1 + 1.26)
## [1] 0.4424779 0.5575221
prop.table(xtabs(y ~ smoker, femsmoke))
## smoker
##       yes        no 
## 0.4429224 0.5570776

Joint independence

  • If we assume the first two variables are dependent, and jointly independent of the third. Then \[ p_{ijk} = p_{ij} p_k \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_{ij} + \log p_k. \]
  • It leads to a log-linear model with just one interaction

    glm(y ~ smoker * dead + age, family = poisson, data = femsmoke) %>%
      summary()
    ## 
    ## Call:
    ## glm(formula = y ~ smoker * dead + age, family = poisson, data = femsmoke)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -8.2606  -5.1564  -0.5933   2.5373  10.4236  
    ## 
    ## Coefficients:
    ##                 Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)      2.51582    0.12239  20.555  < 2e-16 ***
    ## smokerno         0.50361    0.10743   4.688 2.76e-06 ***
    ## deadno           1.15910    0.09722  11.922  < 2e-16 ***
    ## age25-34         0.87618    0.11003   7.963 1.67e-15 ***
    ## age35-44         0.67591    0.11356   5.952 2.65e-09 ***
    ## age45-54         0.57536    0.11556   4.979 6.40e-07 ***
    ## age55-64         0.70166    0.11307   6.206 5.45e-10 ***
    ## age65-74         0.34377    0.12086   2.844  0.00445 ** 
    ## age75+          -0.41837    0.14674  -2.851  0.00436 ** 
    ## smokerno:deadno -0.37858    0.12566  -3.013  0.00259 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 1193.9  on 27  degrees of freedom
    ## Residual deviance:  725.8  on 18  degrees of freedom
    ## AIC: 880
    ## 
    ## Number of Fisher Scoring iterations: 6

    It improves the fit, but still not enough.

Conditional independence

  • Let \(P_{ij \mid k}\) be the probability that an observation falls in \((i, j)\)-cell given that we know the third variables takes value \(k\). If we assume that first two variables are independent give value of the third. Then \[ p_{ijk} = p_{ij\mid k} p_k = p_{i\mid k} p_{j \mid k} p_k = \frac{p_{ik} p_{jk}}{p_k}. \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_{ik} + \log p_{jk} - \log p_k. \]

  • It leads to a model with two interaction terms

glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
  summary()
## 
## Call:
## glm(formula = y ~ smoker * age + dead * age, family = poisson, 
##     data = femsmoke)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.30657  -0.26480  -0.00003   0.26643   1.20822  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.34377    0.58563   0.587 0.557199    
## smokerno             0.11980    0.18523   0.647 0.517785    
## age25-34             0.91760    0.68737   1.335 0.181895    
## age35-44             1.95402    0.62882   3.107 0.001887 ** 
## age45-54             2.84979    0.60950   4.676 2.93e-06 ***
## age55-64             3.44819    0.59868   5.760 8.43e-09 ***
## age65-74             3.00134    0.61023   4.918 8.73e-07 ***
## age75+               2.22118    0.64799   3.428 0.000609 ***
## deadno               3.63759    0.58490   6.219 5.00e-10 ***
## smokerno:age25-34    0.11616    0.22078   0.526 0.598789    
## smokerno:age35-44   -0.01536    0.22749  -0.068 0.946172    
## smokerno:age45-54   -0.63063    0.23414  -2.693 0.007074 ** 
## smokerno:age55-64   -0.06894    0.22643  -0.304 0.760765    
## smokerno:age65-74    1.15649    0.26427   4.376 1.21e-05 ***
## smokerno:age75+      1.47413    0.35617   4.139 3.49e-05 ***
## age25-34:deadno     -0.10756    0.68613  -0.157 0.875435    
## age35-44:deadno     -1.33977    0.62810  -2.133 0.032920 *  
## age45-54:deadno     -2.17125    0.61128  -3.552 0.000382 ***
## age55-64:deadno     -3.17171    0.59999  -5.286 1.25e-07 ***
## age65-74:deadno     -4.94977    0.61512  -8.047 8.49e-16 ***
## age75+:deadno      -26.30450 5776.51889  -0.005 0.996367    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1193.9378  on 27  degrees of freedom
## Residual deviance:    8.3269  on  7  degrees of freedom
## AIC: 184.52
## 
## Number of Fisher Scoring iterations: 17

This model fits data pretty well.

  • Significance of predictors
glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
  drop1(test = "Chi")
## Single term deletions
## 
## Model:
## y ~ smoker * age + dead * age
##            Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>            8.33 184.52                     
## smoker:age  6   101.83 266.03  93.51 < 2.2e-16 ***
## age:dead    6   641.50 805.69 633.17 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Uniform association

  • If we conisder a model with all two-way interactions \[ \log \mathbb{E}Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k + \log p_{ij} + \log p_{ik} + \log p_{jk}. \]
  • There is no simple interpretation in terms of independence.
modu <- glm(y ~ (smoker + age + dead)^2, family = poisson, data = femsmoke)
summary(modu)
## 
## Call:
## glm(formula = y ~ (smoker + age + dead)^2, family = poisson, 
##     data = femsmoke)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.70006  -0.11004  -0.00002   0.12254   0.67272  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.54284    0.58736   0.924 0.355384    
## smokerno            -0.29666    0.25324  -1.171 0.241401    
## age25-34             0.92902    0.68381   1.359 0.174273    
## age35-44             1.94048    0.62486   3.105 0.001900 ** 
## age45-54             2.76845    0.60657   4.564 5.02e-06 ***
## age55-64             3.37507    0.59550   5.668 1.45e-08 ***
## age65-74             2.86586    0.60894   4.706 2.52e-06 ***
## age75+               2.02211    0.64955   3.113 0.001851 ** 
## deadno               3.43271    0.59014   5.817 6.00e-09 ***
## smokerno:age25-34    0.11752    0.22091   0.532 0.594749    
## smokerno:age35-44    0.01268    0.22800   0.056 0.955654    
## smokerno:age45-54   -0.56538    0.23585  -2.397 0.016522 *  
## smokerno:age55-64    0.08512    0.23573   0.361 0.718030    
## smokerno:age65-74    1.49088    0.30039   4.963 6.93e-07 ***
## smokerno:age75+      1.89060    0.39582   4.776 1.78e-06 ***
## smokerno:deadno      0.42741    0.17703   2.414 0.015762 *  
## age25-34:deadno     -0.12006    0.68655  -0.175 0.861178    
## age35-44:deadno     -1.34112    0.62857  -2.134 0.032874 *  
## age45-54:deadno     -2.11336    0.61210  -3.453 0.000555 ***
## age55-64:deadno     -3.18077    0.60057  -5.296 1.18e-07 ***
## age65-74:deadno     -5.08798    0.61951  -8.213  < 2e-16 ***
## age75+:deadno      -27.31727 8839.01146  -0.003 0.997534    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1193.9378  on 27  degrees of freedom
## Residual deviance:    2.3809  on  6  degrees of freedom
## AIC: 180.58
## 
## Number of Fisher Scoring iterations: 18
  • If we compute the odds ratio for each age group
xtabs(fitted(modu) ~ smoker + dead + age, data = femsmoke) %>%
  apply(3, function(x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))
##    18-24    25-34    35-44    45-54    55-64    65-74      75+ 
## 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275

Thus the name uniform association model.

  • The odds ratio can also be extracted from the coefficients
 exp(coef(modu)['smokerno:deadno'])
## smokerno:deadno 
##        1.533275

Model selection

  • We can start from the saturated model (3-way interaction) and use the analysis of deviance to see how much we can reduce the model.
glm(y ~ smoker * age * dead, family = poisson, data = femsmoke) %>%
  step(test = "Chi") %>%
  drop1(test = "Chi")
## Start:  AIC=190.19
## y ~ smoker * age * dead
## 
##                   Df Deviance    AIC    LRT Pr(>Chi)
## - smoker:age:dead  6   2.3809 180.58 2.3809   0.8815
## <none>                 0.0000 190.19                
## 
## Step:  AIC=180.58
## y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead
## 
##               Df Deviance    AIC    LRT Pr(>Chi)    
## <none>               2.38 180.58                    
## - smoker:dead  1     8.33 184.52   5.95  0.01475 *  
## - smoker:age   6    92.63 258.83  90.25  < 2e-16 ***
## - age:dead     6   632.30 798.49 629.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Single term deletions
## 
## Model:
## y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead
##             Df Deviance    AIC    LRT Pr(>Chi)    
## <none>             2.38 180.58                    
## smoker:age   6    92.63 258.83  90.25  < 2e-16 ***
## smoker:dead  1     8.33 184.52   5.95  0.01475 *  
## age:dead     6   632.30 798.49 629.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction term smoker:dead is weakly significant in the final model. This corresponds to the test of conditional independence between smoke and dead given age group. The p-value is very similar to that from the CMH test.

Binomial

  • It also makes intuitive sense to model life status as the outcome depending on perdictors smoker and age.
# glm(dead ~ smoker * age, family = binomial, weights = y, data = femsmoke) %>%
#   summary()
glm(matrix(femsmoke$y, ncol = 2) ~ smoker * age, family = binomial, femsmoke[1:14, ]) %>%
  step(test = "Chi") %>%
  summary()
## Start:  AIC=75
## matrix(femsmoke$y, ncol = 2) ~ smoker * age
## 
##              Df Deviance    AIC    LRT Pr(>Chi)
## - smoker:age  6   2.3809 65.377 2.3809   0.8815
## <none>            0.0000 74.996                
## 
## Step:  AIC=65.38
## matrix(femsmoke$y, ncol = 2) ~ smoker + age
## 
##          Df Deviance    AIC    LRT Pr(>Chi)    
## <none>          2.38  65.38                    
## - smoker  1     8.33  69.32   5.95  0.01475 *  
## - age     6   632.30 683.29 629.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## glm(formula = matrix(femsmoke$y, ncol = 2) ~ smoker + age, family = binomial, 
##     data = femsmoke[1:14, ])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.72545  -0.22836   0.00005   0.19146   0.68162  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.4327     0.5901  -5.817 6.00e-09 ***
## smokerno       -0.4274     0.1770  -2.414 0.015762 *  
## age25-34        0.1201     0.6865   0.175 0.861178    
## age35-44        1.3411     0.6286   2.134 0.032874 *  
## age45-54        2.1134     0.6121   3.453 0.000555 ***
## age55-64        3.1808     0.6006   5.296 1.18e-07 ***
## age65-74        5.0880     0.6195   8.213  < 2e-16 ***
## age75+         27.8073 11293.1430   0.002 0.998035    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 641.4963  on 13  degrees of freedom
## Residual deviance:   2.3809  on  6  degrees of freedom
## AIC: 65.377
## 
## Number of Fisher Scoring iterations: 20
  • The final model retains only main effects. This is equivalent to the uniform association model. The deviance is exactly the same.