Display system information and load tidyverse
and faraway
packages
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.2 magrittr_1.5 tools_3.6.2 htmltools_0.4.0
## [5] yaml_2.2.1 Rcpp_1.0.4.6 stringi_1.4.6 rmarkdown_2.1
## [9] knitr_1.28 stringr_1.4.0 xfun_0.13 digest_0.6.25
## [13] rlang_0.4.5 evaluate_0.14
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.0 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(faraway)
faraway
package contains the datasets in the ELMR book.
Image source: https://www.gep.com/mind/blog/outlook-for-the-global-semiconductor-silicon-wafer-industry
(wafer <- tibble(
y = c(320, 14, 80, 36),
particle = gl(2, 1, length = 4, labels = c("no", "yes")),
quality = gl(2, 2, labels = c("good", "bad"))))
## # A tibble: 4 x 3
## y particle quality
## <dbl> <fct> <fct>
## 1 320 no good
## 2 14 yes good
## 3 80 no bad
## 4 36 yes bad
This kind of data is conveniently presented as a \(2 \times 2\) contingency table.
(ytable <- xtabs(y ~ particle + quality, data = wafer))
## quality
## particle good bad
## no 320 80
## yes 14 36
summary(ytable)
## Call: xtabs(formula = y ~ particle + quality, data = wafer)
## Number of cases in table: 450
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 62.81, df = 1, p-value = 2.274e-15
library(gtsummary)
glm(y ~ particle + quality, family = poisson, data = wafer) %>%
#tbl_regression()
summary()
##
## Call:
## glm(formula = y ~ particle + quality, family = poisson, data = wafer)
##
## Deviance Residuals:
## 1 2 3 4
## 1.324 -4.350 -2.370 5.266
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.6934 0.0572 99.535 <2e-16 ***
## particleyes -2.0794 0.1500 -13.863 <2e-16 ***
## qualitybad -1.0575 0.1078 -9.813 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 474.10 on 3 degrees of freedom
## Residual deviance: 54.03 on 1 degrees of freedom
## AIC: 83.774
##
## Number of Fisher Scoring iterations: 5
The null (intercept-only model) model assumes that the rate is constant across particle and quality levels. The analysis of deviance compares \(474.10 - 54.03\) to 2 degrees of freedom, indicating null model is rejected in favor of our two predictor Poisson model.
If we add an interaction term particle * quality
, the model costs 4 parameters, the same number as the observations. It is equivalent to the full/saturated model. Analysis of deviance compares 54.03
on 1 degree of freedom, indicating the interaction term is highly significant.
Therefore our conclusion is that presence of particle in the die significantly affects the quality of wafer.
If we assume the total sample size is fixed at \(n=450\). Then we can model the counts by a multinomial model. We consdier two models
\(H_1\): one full/saturated model where each cell of the \(2 \times 2\) has its own probability \(p_{ij}\), and
\(H_0\): one reduced model that assumes particle
is independent of quality
, thus \(p_{ij} = p_i p_j\).
For the full/saturated model \(H_1\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j p_{ij}^{y_{ij}}. \] We estimate \(p_{ij}\) by the cell proportion (MLE) \[ \widehat{p}_{ij} = \frac{y_{ij}}{n}. \]
For the reduced model assuming independence \(H_0\), the multinomial likelihood is \[ \frac{n!}{\prod_i \prod_j y_{ij}!} \prod_i \prod_j (p_i p_j)^{y_{ij}} = \frac{n!}{\prod_i \prod_j y_{ij}!} p_i^{\sum_j y_{ij}} p_j^{\sum_i y_{ij}}. \] The MLE is \[ \widehat{p}_i = \frac{\sum_j y_{ij}}{n}, \quad \widehat{p}_j = \frac{\sum_i y_{ij}}{n}. \]
The analysis deviance compares the deviance \[\begin{eqnarray*} D &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n} - 2 \sum_i \left(\sum_j y_{ij}\right) \log \widehat{p}_i - 2 \sum_j \left(\sum_i y_{ij}\right) \log \widehat{p}_j \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{n \widehat{p}_i \widehat{p}_j} \\ &=& 2 \sum_i \sum_j y_{ij} \log \frac{y_{ij}}{\widehat{\mu}_{ij}} \end{eqnarray*}\] to 1 degree of freedom.
(partp <- xtabs(y ~ particle, data = wafer) %>% prop.table())
## particle
## no yes
## 0.8888889 0.1111111
(qualp <- xtabs(y ~ quality, data = wafer) %>% prop.table())
## quality
## good bad
## 0.7422222 0.2577778
muhat <- 450 * outer(partp, qualp)
ytable <- xtabs(y ~ particle + quality, data = wafer)
2 * sum(ytable * log(ytable / muhat))
## [1] 54.03045
We get the exact same result as the analysis of deviance in the Poisson model.
This connection between Poisson and multinomial is no surprise due to the following fact. If \(Y_1, \ldots, Y_k\) are independent Poisson with means \(\lambda_1, \ldots, \lambda_k\), then the joint distribution of \(Y_1, \ldots, Y_k \mid \sum_i Y_i = n\) is multinomial with probabilities \(p_j = \lambda_j / \sum_i \lambda_i\).
If we view particle
as a predictor affecting whether a wafer is good quality or bad quality, we end up with an independent binomial model.
The null (intercept-only) model corresponds to the hypothesis particle
does not affect quality.
tibble(good = c(320, 14),
bad = c(80, 36),
particle = c("no", "yes")) %>%
print() %>%
glm(cbind(good, bad) ~ 1, family = binomial, data = .) %>%
summary()
## # A tibble: 2 x 3
## good bad particle
## <dbl> <dbl> <chr>
## 1 320 80 no
## 2 14 36 yes
##
## Call:
## glm(formula = cbind(good, bad) ~ 1, family = binomial, data = .)
##
## Deviance Residuals:
## 1 2
## 2.715 -6.831
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0576 0.1078 9.813 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 54.03 on 1 degrees of freedom
## Residual deviance: 54.03 on 1 degrees of freedom
## AIC: 66.191
##
## Number of Fisher Scoring iterations: 3
The alternative model corresponds to the full/saturated model where particle
is included as a predictor.
Again we observe the exactly same analysis of deviance inference.
When there are more than two rows or columns, this model is called the product binoimal/multinomial model.
Finally if we fix both row and column marginal totals, the probability of the observed table is \[\begin{eqnarray*} p(y_{11}, y_{12}, y_{21}, y_{22}) &=& \frac{\binom{y_{1\cdot}}{y_{11}} \binom{y_{2\cdot}}{y_{22}} \binom{y_{\cdot 1}}{y_{21}} \binom{y_{\cdot 2}}{y_{12}}}{\binom{n}{y_{11} \, y_{12} \, y_{21} \, y_{22}}} \\ &=& \frac{y_{1\cdot}! y_{2\cdot}! y_{\cdot 1}! y_{\cdot 2}!}{y_{11}! y_{12}! y_{21}! y_{22}! n!} \\ &=& \frac{\binom{y_{1 \cdot}}{y_{11}} \binom{y_{2 \cdot}}{y_{\cdot 1} - y_{11}}}{\binom{n}{y_{\cdot 1}}}. \end{eqnarray*}\]
Under the null hypothesis that particle
is not associated with quality
, the Fisher’s exact test calculates the p-value by summing over the probabilities of tables with more extreme observations \[
\sum_{y_{11} \ge 320} p(y_{11}, y_{12}, y_{21}, y_{22}).
\]
fisher.test(ytable)
##
## Fisher's Exact Test for Count Data
##
## data: ytable
## p-value = 2.955e-13
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 5.090628 21.544071
## sample estimates:
## odds ratio
## 10.21331
The odds ratio is \[ \frac{\pi_{\text{no particle}} / (1 - \pi_{\text{no particle}})}{\pi_{\text{particle}} / (1 - \pi_{\text{particle}})} = \frac{y_{11} y_{22}}{y_{12} y_{21}} \]
(320 * 36) / (14 * 80)
## [1] 10.28571
haireye
data set contains data on 592 statistics students cross-classifed by hair and eye color.haireye <- as_tibble(haireye) %>%
print(n = Inf)
## # A tibble: 16 x 3
## y eye hair
## <dbl> <fct> <fct>
## 1 5 green BLACK
## 2 29 green BROWN
## 3 14 green RED
## 4 16 green BLOND
## 5 15 hazel BLACK
## 6 54 hazel BROWN
## 7 14 hazel RED
## 8 10 hazel BLOND
## 9 20 blue BLACK
## 10 84 blue BROWN
## 11 17 blue RED
## 12 94 blue BLOND
## 13 68 brown BLACK
## 14 119 brown BROWN
## 15 26 brown RED
## 16 7 brown BLOND
(haireye_table <- xtabs(y ~ hair + eye, data = haireye))
## eye
## hair green hazel blue brown
## BLACK 5 15 20 68
## BROWN 29 54 84 119
## RED 14 14 17 26
## BLOND 16 10 94 7
eye
and hair
are independent, we expect to see a grid.mosaicplot(haireye_table, color = TRUE, main = NULL, las = 1)
summary(haireye_table)
## Call: xtabs(formula = y ~ hair + eye, data = haireye)
## Number of cases in table: 592
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 138.29, df = 9, p-value = 2.325e-25
We fit a Poisson GLM:
modc <- glm(y ~ hair + eye, family = poisson, data = haireye)
summary(modc)
##
## Call:
## glm(formula = y ~ hair + eye, family = poisson, data = haireye)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.326 -2.065 -0.212 1.235 6.172
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4575 0.1523 16.136 < 2e-16 ***
## hairBROWN 0.9739 0.1129 8.623 < 2e-16 ***
## hairRED -0.4195 0.1528 -2.745 0.00604 **
## hairBLOND 0.1621 0.1309 1.238 0.21569
## eyehazel 0.3737 0.1624 2.301 0.02139 *
## eyeblue 1.2118 0.1424 8.510 < 2e-16 ***
## eyebrown 1.2347 0.1420 8.694 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 453.31 on 15 degrees of freedom
## Residual deviance: 146.44 on 9 degrees of freedom
## AIC: 241.04
##
## Number of Fisher Scoring iterations: 5
which clearly shows a lack of fit. The interaction model is equivalent to the full/saturated model. Therefore we see strong evidence for the dependence between eye
and hair
.
We follow up by a correspondence analysis to study how eye
and hair
are dependent.
(R <- xtabs(residuals(modc, type = "pearson") ~ hair + eye, data = haireye))
## eye
## hair green hazel blue brown
## BLACK -1.95368354 -0.47735203 -3.06937747 4.39839852
## BROWN -0.34509961 1.35328398 -1.94947682 1.23345810
## RED 2.28273672 0.85225273 -1.73012546 -0.07497794
## BLOND 0.61269815 -2.22784430 7.04959022 -5.85099741
(svdr <- svd(R, nu = 2, nv = 2))
## $d
## [1] 1.111726e+01 3.627417e+00 1.240273e+00 1.678879e-10
##
## $u
## [,1] [,2]
## [1,] -0.47166009 0.6154461
## [2,] -0.22552151 -0.1522951
## [3,] -0.09817011 -0.7424993
## [4,] 0.84678181 0.2161646
##
## $v
## [,1] [,2]
## [1,] 0.1163980 -0.7477267
## [2,] -0.1844169 -0.4450167
## [3,] 0.7220003 0.3353208
## [4,] -0.6566258 0.3611439
cplot_df <- tibble(
dim1 = sqrt(svdr$d[1]) * c(svdr$u[, 1], svdr$v[, 1]),
dim2 = sqrt(svdr$d[2]) * c(svdr$u[, 2], svdr$v[, 2]),
label = c(rownames(R), colnames(R)),
var = rep(c("hair", "eye"), each = 4)) %>%
print(n = Inf)
## # A tibble: 8 x 4
## dim1 dim2 label var
## <dbl> <dbl> <chr> <chr>
## 1 -1.57 1.17 BLACK hair
## 2 -0.752 -0.290 BROWN hair
## 3 -0.327 -1.41 RED hair
## 4 2.82 0.412 BLOND hair
## 5 0.388 -1.42 green eye
## 6 -0.615 -0.848 hazel eye
## 7 2.41 0.639 blue eye
## 8 -2.19 0.688 brown eye
library(ggrepel)
# percent of variance explained
pve1 <- svdr$d[1] / sum(svdr$d)
pve2 <- svdr$d[2] / sum(svdr$d)
cplot_df %>%
ggplot(mapping = aes(x = dim1, y = dim2, color = var, shape = var)) +
geom_point() +
geom_label_repel(mapping = aes(label = label), show.legend = F) +
geom_vline(xintercept = 0, lty = "dashed", alpha = 0.5) +
geom_hline(yintercept = 0, lty = "dashed", alpha = 0.5) +
labs(x = str_c("Dimension 1 (", signif(pve1 * 100, 3), "%)"),
y = str_c("Dimension 2 (", signif(pve2 * 100, 3), "%)"))
BLONDE
hair.BLONDE
hair + blue
eye. If they are diametrically apart on either side of the origin, then there is a strong negative association. For example, BLONDE
hair + brown
eye.hazel
and green
eyes.femsoke
data records female smokers and non-smokers, their age group, and whether the subjects were dead or still alive after 20 years.femsmoke <- as_tibble(femsmoke) %>%
print(n = Inf)
## # A tibble: 28 x 4
## y smoker dead age
## <dbl> <fct> <fct> <fct>
## 1 2 yes yes 18-24
## 2 1 no yes 18-24
## 3 3 yes yes 25-34
## 4 5 no yes 25-34
## 5 14 yes yes 35-44
## 6 7 no yes 35-44
## 7 27 yes yes 45-54
## 8 12 no yes 45-54
## 9 51 yes yes 55-64
## 10 40 no yes 55-64
## 11 29 yes yes 65-74
## 12 101 no yes 65-74
## 13 13 yes yes 75+
## 14 64 no yes 75+
## 15 53 yes no 18-24
## 16 61 no no 18-24
## 17 121 yes no 25-34
## 18 152 no no 25-34
## 19 95 yes no 35-44
## 20 114 no no 35-44
## 21 103 yes no 45-54
## 22 66 no no 45-54
## 23 64 yes no 55-64
## 24 81 no no 55-64
## 25 7 yes no 65-74
## 26 28 no no 65-74
## 27 0 yes no 75+
## 28 0 no no 75+
There are three factors smoker
, dead
, and age
. If we just classify over smoker
and dead
(ct <- xtabs(y ~ smoker + dead, data = femsmoke))
## dead
## smoker yes no
## yes 139 443
## no 230 502
prop.table(ct, 1)
## dead
## smoker yes no
## yes 0.2388316 0.7611684
## no 0.3142077 0.6857923
we see significantly higher percentage of non-smokers died than smokers.
summary(ct)
## Call: xtabs(formula = y ~ smoker + dead, data = femsmoke)
## Number of cases in table: 1314
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 9.121, df = 1, p-value = 0.002527
smoke
status is confounded with the age
group. Smokers are more concentrated in the younger groups and younger people are more likely to live for another 20 years.prop.table(xtabs(y ~ smoker + age, data = femsmoke), 2)
## age
## smoker 18-24 25-34 35-44 45-54 55-64 65-74 75+
## yes 0.4700855 0.4412811 0.4739130 0.6250000 0.4872881 0.2181818 0.1688312
## no 0.5299145 0.5587189 0.5260870 0.3750000 0.5127119 0.7818182 0.8311688
ct3 <- xtabs(y ~ smoker + dead + age, data = femsmoke)
apply(ct3, 3, function (x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))
## 18-24 25-34 35-44 45-54 55-64 65-74 75+
## 2.301887 0.753719 2.400000 1.441748 1.613672 1.148515 NaN
mantelhaen.test(ct3, exact = TRUE)
##
## Exact conditional test of independence in 2 x 2 x k tables
##
## data: ct3
## S = 139, p-value = 0.01591
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.068889 2.203415
## sample estimates:
## common odds ratio
## 1.530256
Under mutual independence, \[ p_{ijk} = p_i p_j p_k \] so \[ \mathbb{E} Y_{ijk} = n p_{ijk} \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_i + \log p_j + \log p_k. \] So the main effect-only model corresponds to the mutual independence model.
We can test independence by the Pearson \(X^2\) test
summary(ct3)
## Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke)
## Number of cases in table: 1314
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 790.6, df = 19, p-value = 2.14e-155
or by a Poisson GLM
modi <- glm(y ~ smoker + dead + age, family = poisson, data = femsmoke)
summary(modi)
##
## Call:
## glm(formula = y ~ smoker + dead + age, family = poisson, data = femsmoke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.9306 -5.3175 -0.5514 2.4229 11.1895
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.67778 0.10702 25.021 < 2e-16 ***
## smokerno 0.22931 0.05554 4.129 3.64e-05 ***
## deadno 0.94039 0.06139 15.319 < 2e-16 ***
## age25-34 0.87618 0.11003 7.963 1.67e-15 ***
## age35-44 0.67591 0.11356 5.952 2.65e-09 ***
## age45-54 0.57536 0.11556 4.979 6.40e-07 ***
## age55-64 0.70166 0.11307 6.206 5.45e-10 ***
## age65-74 0.34377 0.12086 2.844 0.00445 **
## age75+ -0.41837 0.14674 -2.851 0.00436 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1193.9 on 27 degrees of freedom
## Residual deviance: 735.0 on 19 degrees of freedom
## AIC: 887.2
##
## Number of Fisher Scoring iterations: 6
Both suggest a lack of fit.
modi %>% tbl_regression(exponentiate = TRUE, intercept = TRUE)
Characteristic | IRR1 | 95% CI1 | p-value |
---|---|---|---|
(Intercept) | 14.6 | 11.7, 17.9 | <0.001 |
smoker | |||
yes | — | — | |
no | 1.26 | 1.13, 1.40 | <0.001 |
dead | |||
yes | — | — | |
no | 2.56 | 2.27, 2.89 | <0.001 |
age | |||
18-24 | — | — | |
25-34 | 2.40 | 1.94, 2.99 | <0.001 |
35-44 | 1.97 | 1.58, 2.46 | <0.001 |
45-54 | 1.78 | 1.42, 2.24 | <0.001 |
55-64 | 2.02 | 1.62, 2.53 | <0.001 |
65-74 | 1.41 | 1.11, 1.79 | 0.004 |
75+ | 0.66 | 0.49, 0.88 | 0.004 |
1
IRR = Incidence Rate Ratio, CI = Confidence Interval
|
c(1, 1.26) / (1 + 1.26)
## [1] 0.4424779 0.5575221
prop.table(xtabs(y ~ smoker, femsmoke))
## smoker
## yes no
## 0.4429224 0.5570776
It leads to a log-linear model with just one interaction
glm(y ~ smoker * dead + age, family = poisson, data = femsmoke) %>%
summary()
##
## Call:
## glm(formula = y ~ smoker * dead + age, family = poisson, data = femsmoke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.2606 -5.1564 -0.5933 2.5373 10.4236
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.51582 0.12239 20.555 < 2e-16 ***
## smokerno 0.50361 0.10743 4.688 2.76e-06 ***
## deadno 1.15910 0.09722 11.922 < 2e-16 ***
## age25-34 0.87618 0.11003 7.963 1.67e-15 ***
## age35-44 0.67591 0.11356 5.952 2.65e-09 ***
## age45-54 0.57536 0.11556 4.979 6.40e-07 ***
## age55-64 0.70166 0.11307 6.206 5.45e-10 ***
## age65-74 0.34377 0.12086 2.844 0.00445 **
## age75+ -0.41837 0.14674 -2.851 0.00436 **
## smokerno:deadno -0.37858 0.12566 -3.013 0.00259 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1193.9 on 27 degrees of freedom
## Residual deviance: 725.8 on 18 degrees of freedom
## AIC: 880
##
## Number of Fisher Scoring iterations: 6
It improves the fit, but still not enough.
Let \(P_{ij \mid k}\) be the probability that an observation falls in \((i, j)\)-cell given that we know the third variables takes value \(k\). If we assume that first two variables are independent give value of the third. Then \[ p_{ijk} = p_{ij\mid k} p_k = p_{i\mid k} p_{j \mid k} p_k = \frac{p_{ik} p_{jk}}{p_k}. \] and \[ \log \mathbb{E} Y_{ijk} = \log n + \log p_{ik} + \log p_{jk} - \log p_k. \]
It leads to a model with two interaction terms
glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
summary()
##
## Call:
## glm(formula = y ~ smoker * age + dead * age, family = poisson,
## data = femsmoke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.30657 -0.26480 -0.00003 0.26643 1.20822
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.34377 0.58563 0.587 0.557199
## smokerno 0.11980 0.18523 0.647 0.517785
## age25-34 0.91760 0.68737 1.335 0.181895
## age35-44 1.95402 0.62882 3.107 0.001887 **
## age45-54 2.84979 0.60950 4.676 2.93e-06 ***
## age55-64 3.44819 0.59868 5.760 8.43e-09 ***
## age65-74 3.00134 0.61023 4.918 8.73e-07 ***
## age75+ 2.22118 0.64799 3.428 0.000609 ***
## deadno 3.63759 0.58490 6.219 5.00e-10 ***
## smokerno:age25-34 0.11616 0.22078 0.526 0.598789
## smokerno:age35-44 -0.01536 0.22749 -0.068 0.946172
## smokerno:age45-54 -0.63063 0.23414 -2.693 0.007074 **
## smokerno:age55-64 -0.06894 0.22643 -0.304 0.760765
## smokerno:age65-74 1.15649 0.26427 4.376 1.21e-05 ***
## smokerno:age75+ 1.47413 0.35617 4.139 3.49e-05 ***
## age25-34:deadno -0.10756 0.68613 -0.157 0.875435
## age35-44:deadno -1.33977 0.62810 -2.133 0.032920 *
## age45-54:deadno -2.17125 0.61128 -3.552 0.000382 ***
## age55-64:deadno -3.17171 0.59999 -5.286 1.25e-07 ***
## age65-74:deadno -4.94977 0.61512 -8.047 8.49e-16 ***
## age75+:deadno -26.30450 5776.51889 -0.005 0.996367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1193.9378 on 27 degrees of freedom
## Residual deviance: 8.3269 on 7 degrees of freedom
## AIC: 184.52
##
## Number of Fisher Scoring iterations: 17
This model fits data pretty well.
glm(y ~ smoker * age + dead * age, family = poisson, data = femsmoke) %>%
drop1(test = "Chi")
## Single term deletions
##
## Model:
## y ~ smoker * age + dead * age
## Df Deviance AIC LRT Pr(>Chi)
## <none> 8.33 184.52
## smoker:age 6 101.83 266.03 93.51 < 2.2e-16 ***
## age:dead 6 641.50 805.69 633.17 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
modu <- glm(y ~ (smoker + age + dead)^2, family = poisson, data = femsmoke)
summary(modu)
##
## Call:
## glm(formula = y ~ (smoker + age + dead)^2, family = poisson,
## data = femsmoke)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.70006 -0.11004 -0.00002 0.12254 0.67272
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.54284 0.58736 0.924 0.355384
## smokerno -0.29666 0.25324 -1.171 0.241401
## age25-34 0.92902 0.68381 1.359 0.174273
## age35-44 1.94048 0.62486 3.105 0.001900 **
## age45-54 2.76845 0.60657 4.564 5.02e-06 ***
## age55-64 3.37507 0.59550 5.668 1.45e-08 ***
## age65-74 2.86586 0.60894 4.706 2.52e-06 ***
## age75+ 2.02211 0.64955 3.113 0.001851 **
## deadno 3.43271 0.59014 5.817 6.00e-09 ***
## smokerno:age25-34 0.11752 0.22091 0.532 0.594749
## smokerno:age35-44 0.01268 0.22800 0.056 0.955654
## smokerno:age45-54 -0.56538 0.23585 -2.397 0.016522 *
## smokerno:age55-64 0.08512 0.23573 0.361 0.718030
## smokerno:age65-74 1.49088 0.30039 4.963 6.93e-07 ***
## smokerno:age75+ 1.89060 0.39582 4.776 1.78e-06 ***
## smokerno:deadno 0.42741 0.17703 2.414 0.015762 *
## age25-34:deadno -0.12006 0.68655 -0.175 0.861178
## age35-44:deadno -1.34112 0.62857 -2.134 0.032874 *
## age45-54:deadno -2.11336 0.61210 -3.453 0.000555 ***
## age55-64:deadno -3.18077 0.60057 -5.296 1.18e-07 ***
## age65-74:deadno -5.08798 0.61951 -8.213 < 2e-16 ***
## age75+:deadno -27.31727 8839.01146 -0.003 0.997534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1193.9378 on 27 degrees of freedom
## Residual deviance: 2.3809 on 6 degrees of freedom
## AIC: 180.58
##
## Number of Fisher Scoring iterations: 18
xtabs(fitted(modu) ~ smoker + dead + age, data = femsmoke) %>%
apply(3, function(x) (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1]))
## 18-24 25-34 35-44 45-54 55-64 65-74 75+
## 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275 1.533275
Thus the name uniform association model.
exp(coef(modu)['smokerno:deadno'])
## smokerno:deadno
## 1.533275
glm(y ~ smoker * age * dead, family = poisson, data = femsmoke) %>%
step(test = "Chi") %>%
drop1(test = "Chi")
## Start: AIC=190.19
## y ~ smoker * age * dead
##
## Df Deviance AIC LRT Pr(>Chi)
## - smoker:age:dead 6 2.3809 180.58 2.3809 0.8815
## <none> 0.0000 190.19
##
## Step: AIC=180.58
## y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead
##
## Df Deviance AIC LRT Pr(>Chi)
## <none> 2.38 180.58
## - smoker:dead 1 8.33 184.52 5.95 0.01475 *
## - smoker:age 6 92.63 258.83 90.25 < 2e-16 ***
## - age:dead 6 632.30 798.49 629.92 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Single term deletions
##
## Model:
## y ~ smoker + age + dead + smoker:age + smoker:dead + age:dead
## Df Deviance AIC LRT Pr(>Chi)
## <none> 2.38 180.58
## smoker:age 6 92.63 258.83 90.25 < 2e-16 ***
## smoker:dead 1 8.33 184.52 5.95 0.01475 *
## age:dead 6 632.30 798.49 629.92 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction term smoker:dead
is weakly significant in the final model. This corresponds to the test of conditional independence between smoke
and dead
given age
group. The p-value is very similar to that from the CMH test.
smoker
and age
.# glm(dead ~ smoker * age, family = binomial, weights = y, data = femsmoke) %>%
# summary()
glm(matrix(femsmoke$y, ncol = 2) ~ smoker * age, family = binomial, femsmoke[1:14, ]) %>%
step(test = "Chi") %>%
summary()
## Start: AIC=75
## matrix(femsmoke$y, ncol = 2) ~ smoker * age
##
## Df Deviance AIC LRT Pr(>Chi)
## - smoker:age 6 2.3809 65.377 2.3809 0.8815
## <none> 0.0000 74.996
##
## Step: AIC=65.38
## matrix(femsmoke$y, ncol = 2) ~ smoker + age
##
## Df Deviance AIC LRT Pr(>Chi)
## <none> 2.38 65.38
## - smoker 1 8.33 69.32 5.95 0.01475 *
## - age 6 632.30 683.29 629.92 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## glm(formula = matrix(femsmoke$y, ncol = 2) ~ smoker + age, family = binomial,
## data = femsmoke[1:14, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.72545 -0.22836 0.00005 0.19146 0.68162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4327 0.5901 -5.817 6.00e-09 ***
## smokerno -0.4274 0.1770 -2.414 0.015762 *
## age25-34 0.1201 0.6865 0.175 0.861178
## age35-44 1.3411 0.6286 2.134 0.032874 *
## age45-54 2.1134 0.6121 3.453 0.000555 ***
## age55-64 3.1808 0.6006 5.296 1.18e-07 ***
## age65-74 5.0880 0.6195 8.213 < 2e-16 ***
## age75+ 27.8073 11293.1430 0.002 0.998035
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 641.4963 on 13 degrees of freedom
## Residual deviance: 2.3809 on 6 degrees of freedom
## AIC: 65.377
##
## Number of Fisher Scoring iterations: 20