Multinomial Data (ELMR Chapter 7)

Multinomial logit model
- 1996 election data
- Fit multinomial-logit
Hierarchical or nested responses
Ordinal multinomial responses

Display system information and load tidyverse and faraway packages

sessionInfo()

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.13       digest_0.6.25  
## [13] rlang_0.4.5     evaluate_0.14

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(faraway)

faraway package contains the datasets in the ELMR book.

Multinomial logit model

Multinomial regression is a natural extension of the binomial regression to the case when responses are multivariate counts. Let $Y_{ij}$ be the number of observations falling into category $j$ for batch $i$ and let $n_i = \sum_j Y_{ij}$ be the batch size. Then \[ \mathbb{P}(Y_{i1} = y_{i1}, \ldots, Y_{iJ} = y_{iJ}) = \binom{n_i}{y_{i1} \cdots y_{iJ}} p_{i1}^{y_{i1}} \cdots p_{iJ}^{y_{iJ}}. \]
We distinguish between nominal multinomial data where there is no natural order to the categories and ordinal multinomial data where there is an order. The multinomial logit model is indended for nominal data. It can be used for ordinal data, but the informtion about order will not be used.
We use the same idea as the logistic regression. \[ \eta_{ij} = \mathbf{x}_i^T \boldsymbol{\beta}_j = \log \frac{p_{ij}}{p_{i1}}, \quad j = 2,\ldots,J. \] We need to obey the constraint that $\sum_{i=1}^J p_{ij} = 1$ so it is convenient to declare one of the categories as the baseline say $j=1$. Then we set $p_{i1} = 1 - \sum_{j=2}^J p_{ij}$ and have \[ p_{ij} = \frac{\exp(\eta_{ij})}{1 + \sum_{j=2}^J \exp(\eta_{ij})}. \]
Suppose the length of $\mathbf{x}_i$ is $p$, then the multinomial-logit model consumes $p(J-1)$ parameters.
We may estimate the parameters using MLE and the standard methods for inference.

1996 election data

Consider an example of the 1996 American National Election Study. We consider only the education level and income group of the respondents. We will study how party identification of the respondent (Democrat, Independent or Republican) depends on predictors age, education level, and income.

(nes96 <- as_tibble(nes96))

## # A tibble: 944 x 10
##    popul TVnews selfLR ClinLR DoleLR PID       age educ  income   vote   
##    <int>  <int> <ord>  <ord>  <ord>  <ord>   <int> <ord> <ord>    <fct>  
##  1     0      7 extCon extLib Con    strRep     36 HS    $3Kminus Dole   
##  2   190      1 sliLib sliLib sliCon weakDem    20 Coll  $3Kminus Clinton
##  3    31      7 Lib    Lib    Con    weakDem    24 BAdeg $3Kminus Clinton
##  4    83      4 sliLib Mod    sliCon weakDem    28 BAdeg $3Kminus Clinton
##  5   640      7 sliCon Con    Mod    strDem     68 BAdeg $3Kminus Clinton
##  6   110      3 sliLib Mod    Con    weakDem    21 Coll  $3Kminus Clinton
##  7   100      7 sliCon Con    Mod    weakDem    77 Coll  $3Kminus Clinton
##  8    31      1 sliCon Mod    sliCon indRep     21 Coll  $3Kminus Clinton
##  9   180      7 Mod    Con    sliLib indind     31 Coll  $3Kminus Clinton
## 10  2800      0 sliLib sliLib extCon strDem     39 HS    $3Kminus Clinton
## # … with 934 more rows

Numerical summaries:

summary(nes96)

##      popul            TVnews         selfLR       ClinLR       DoleLR   
##  Min.   :   0.0   Min.   :0.000   extLib: 16   extLib:109   extLib: 13  
##  1st Qu.:   1.0   1st Qu.:1.000   Lib   :103   Lib   :317   Lib   : 31  
##  Median :  22.0   Median :3.000   sliLib:147   sliLib:236   sliLib: 43  
##  Mean   : 306.4   Mean   :3.728   Mod   :256   Mod   :160   Mod   : 87  
##  3rd Qu.: 110.0   3rd Qu.:7.000   sliCon:170   sliCon: 67   sliCon:195  
##  Max.   :7300.0   Max.   :7.000   Con   :218   Con   : 36   Con   :460  
##                                   extCon: 34   extCon: 19   extCon:115  
##       PID           age            educ           income         vote    
##  strDem :200   Min.   :19.00   MS    : 13   $60K-$75K:103   Clinton:551  
##  weakDem:180   1st Qu.:34.00   HSdrop: 52   $50K-$60K:100   Dole   :393  
##  indDem :108   Median :44.00   HS    :248   $30K-$35K: 70                
##  indind : 37   Mean   :47.04   Coll  :187   $25K-$30K: 68                
##  indRep : 94   3rd Qu.:58.00   CCdeg : 90   $105Kplus: 68                
##  weakRep:150   Max.   :91.00   BAdeg :227   $35K-$40K: 62                
##  strRep :175                   MAdeg :127   (Other)  :473

The original variable PID (party identification) has ordered categories strDem (strong democratic), weakDem (weak democratic), indDem (independent democratic), indind (independent), indRep (independent republican), weakRep (weak Republican), and strRep (strong republican). In this analysis we only consider the coarsened categories: Democratic, Independent, and Republican.

nes96 <- nes96 %>%
  mutate(sPID = recode(PID, 
                       strDem  = "Democrat", 
                       weakDem = "Democrat", 
                       indDem  = "Independent", 
                       indind  = "Independent", 
                       indRep  = "Independent", 
                       weakRep = "Republican", 
                       strRep  = "Republican"))
summary(nes96$sPID)

##    Democrat Independent  Republican 
##         380         239         325

The income variable in the original data was an ordered factor with income ranges. We convert this to a numeric variable by taking the midpoint of each range.

table(nes96$income)

## 
##   $3Kminus    $3K-$5K    $5K-$7K    $7K-$9K   $9K-$10K  $10K-$11K  $11K-$12K 
##         19         12         17         19         18         13         11 
##  $12K-$13K  $13K-$14K  $14K-$15K  $15K-$17K  $17K-$20K  $20K-$22K  $22K-$25K 
##         17         10         15         23         35         26         39 
##  $25K-$30K  $30K-$35K  $35K-$40K  $40K-$45K  $45K-$50K  $50K-$60K  $60K-$75K 
##         68         70         62         48         51        100        103 
##  $75K-$90K $90K-$105K  $105Kplus 
##         53         47         68

inca <- c(1.5, 4, 6, 8, 9.5, 10.5, 11.5, 12.5, 13.5, 
          14.5, 16, 18.5, 21, 23.5, 27.5, 32.5, 37.5, 
          42.5, 47.5, 55, 67.5, 82.5, 97.5, 115)
nes96 <- nes96 %>%
  mutate(nincome = inca[unclass(income)])
summary(nes96$nincome)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.50   23.50   37.50   46.58   67.50  115.00

Graphical summaries.

How does party identification changes with education level?

nes96 %>%
  count(educ, sPID) %>%
  group_by(educ) %>%
  mutate(prop = prop.table(n)) %>%
  print(width = Inf) %>%
  ggplot() +
  geom_line(mapping = aes(x = educ, y = prop, group = sPID, color = sPID)) + 
  scale_color_manual(values = c("blue", "green", "red")) + 
  labs(x = "Education", y = "Proportion")

## # A tibble: 21 x 4
## # Groups:   educ [7]
##    educ   sPID            n   prop
##    <ord>  <ord>       <int>  <dbl>
##  1 MS     Democrat        9 0.692 
##  2 MS     Independent     3 0.231 
##  3 MS     Republican      1 0.0769
##  4 HSdrop Democrat       29 0.558 
##  5 HSdrop Independent    14 0.269 
##  6 HSdrop Republican      9 0.173 
##  7 HS     Democrat      108 0.435 
##  8 HS     Independent    63 0.254 
##  9 HS     Republican     77 0.310 
## 10 Coll   Democrat       74 0.396 
## # … with 11 more rows

How does party identification changes with income level?

nes96 %>%
  count(income, sPID) %>%
  group_by(income) %>%
  mutate(prop = prop.table(n)) %>%
  print(width = Inf) %>%
  ggplot() +
  geom_line(mapping = aes(x = income, y = prop, group = sPID, color = sPID)) + 
  scale_color_manual(values = c("blue", "green", "red")) + 
  labs(x = "Income", y = "Proportion") + 
  theme(axis.text.x = element_text(angle = 90))

## # A tibble: 72 x 4
## # Groups:   income [24]
##    income   sPID            n   prop
##    <ord>    <ord>       <int>  <dbl>
##  1 $3Kminus Democrat       13 0.684 
##  2 $3Kminus Independent     3 0.158 
##  3 $3Kminus Republican      3 0.158 
##  4 $3K-$5K  Democrat        7 0.583 
##  5 $3K-$5K  Independent     4 0.333 
##  6 $3K-$5K  Republican      1 0.0833
##  7 $5K-$7K  Democrat        9 0.529 
##  8 $5K-$7K  Independent     3 0.176 
##  9 $5K-$7K  Republican      5 0.294 
## 10 $7K-$9K  Democrat       11 0.579 
## # … with 62 more rows

How does party identification changes with age?

nes96 %>%
  mutate(age_grp = cut(age, breaks = seq(from = 10, to = 100, by = 10))) %>%
  count(age_grp, sPID) %>%
  group_by(age_grp) %>%
  mutate(prop = prop.table(n)) %>%
  print(width = Inf) %>%
  ggplot() +
  geom_line(mapping = aes(x = age_grp, y = prop, group = sPID, color = sPID)) + 
  scale_color_manual(values = c("blue", "green", "red")) + 
  labs(x = "Age", y = "Proportion") + 
  theme(axis.text.x = element_text(angle = 90))

## # A tibble: 26 x 4
## # Groups:   age_grp [9]
##    age_grp sPID            n  prop
##    <fct>   <ord>       <int> <dbl>
##  1 (10,20] Democrat        3 0.333
##  2 (10,20] Independent     2 0.222
##  3 (10,20] Republican      4 0.444
##  4 (20,30] Democrat       60 0.438
##  5 (20,30] Independent    37 0.270
##  6 (20,30] Republican     40 0.292
##  7 (30,40] Democrat       87 0.348
##  8 (30,40] Independent    65 0.26 
##  9 (30,40] Republican     98 0.392
## 10 (40,50] Democrat       89 0.441
## # … with 16 more rows

Fit multinomial-logit

To verify that the trends observed in these graphs are statistical significant, we need a formal model. The multinom function is avaible in the nnet package.

library(nnet)
mmod <- multinom(sPID ~ age + educ + nincome, data = nes96)

## # weights:  30 (18 variable)
## initial  value 1037.090001 
## iter  10 value 990.568608
## iter  20 value 984.319052
## final  value 984.166272 
## converged

summary(mmod)

## Call:
## multinom(formula = sPID ~ age + educ + nincome, data = nes96)
## 
## Coefficients:
##             (Intercept)          age     educ.L     educ.Q    educ.C
## Independent   -1.197260 0.0001534525 0.06351451 -0.1217038 0.1119542
## Republican    -1.642656 0.0081943691 1.19413345 -1.2292869 0.1544575
##                  educ^4     educ^5      educ^6    nincome
## Independent -0.07657336  0.1360851  0.15427826 0.01623911
## Republican  -0.02827297 -0.1221176 -0.03741389 0.01724679
## 
## Std. Errors:
##             (Intercept)         age    educ.L    educ.Q    educ.C    educ^4
## Independent   0.3265951 0.005374592 0.4571884 0.4142859 0.3498491 0.2883031
## Republican    0.3312877 0.004902668 0.6502670 0.6041924 0.4866432 0.3605620
##                educ^5    educ^6     nincome
## Independent 0.2494706 0.2171578 0.003108585
## Republican  0.2696036 0.2031859 0.002881745
## 
## Residual Deviance: 1968.333 
## AIC: 2004.333

We can select which variables to include in the model based the AIC criterion using a stepwise search method.

step(mmod)

## Start:  AIC=2004.33
## sPID ~ age + educ + nincome
## 
## trying - age 
## # weights:  27 (16 variable)
## initial  value 1037.090001 
## iter  10 value 988.896864
## iter  20 value 985.822223
## final  value 985.812737 
## converged
## trying - educ 
## # weights:  12 (6 variable)
## initial  value 1037.090001 
## iter  10 value 992.269502
## final  value 992.269484 
## converged
## trying - nincome 
## # weights:  27 (16 variable)
## initial  value 1037.090001 
## iter  10 value 1009.025560
## iter  20 value 1006.961593
## final  value 1006.955275 
## converged
##           Df      AIC
## - educ     6 1996.539
## - age     16 2003.625
## <none>    18 2004.333
## - nincome 16 2045.911
## # weights:  12 (6 variable)
## initial  value 1037.090001 
## iter  10 value 992.269502
## final  value 992.269484 
## converged
## 
## Step:  AIC=1996.54
## sPID ~ age + nincome
## 
## trying - age 
## # weights:  9 (4 variable)
## initial  value 1037.090001 
## final  value 992.712152 
## converged
## trying - nincome 
## # weights:  9 (4 variable)
## initial  value 1037.090001 
## final  value 1020.425203 
## converged
##           Df      AIC
## - age      4 1993.424
## <none>     6 1996.539
## - nincome  4 2048.850
## # weights:  9 (4 variable)
## initial  value 1037.090001 
## final  value 992.712152 
## converged
## 
## Step:  AIC=1993.42
## sPID ~ nincome
## 
## trying - nincome 
## # weights:  6 (2 variable)
## initial  value 1037.090001 
## final  value 1020.636052 
## converged
##           Df      AIC
## <none>     4 1993.424
## - nincome  2 2045.272

## Call:
## multinom(formula = sPID ~ nincome, data = nes96)
## 
## Coefficients:
##             (Intercept)    nincome
## Independent  -1.1749331 0.01608683
## Republican   -0.9503591 0.01766457
## 
## Residual Deviance: 1985.424 
## AIC: 1993.424

Or we can use the standard likelihood methods to derive a test to compare nested models. For example, we can fit a model with predictor nincome only and then compare the deviances.

mmodi <- multinom(sPID ~ nincome, data = nes96)

## # weights:  9 (4 variable)
## initial  value 1037.090001 
## final  value 992.712152 
## converged

deviance(mmodi) - deviance(mmod)

## [1] 17.09176

mmod$edf - mmodi$edf

## [1] 14

pchisq(deviance(mmodi) - deviance(mmod), mmod$edf - mmodi$edf, lower = F)

## [1] 0.2513209

To interpret the coefficients:

summary(mmodi)

## Call:
## multinom(formula = sPID ~ nincome, data = nes96)
## 
## Coefficients:
##             (Intercept)    nincome
## Independent  -1.1749331 0.01608683
## Republican   -0.9503591 0.01766457
## 
## Std. Errors:
##             (Intercept)     nincome
## Independent   0.1536103 0.002849738
## Republican    0.1416859 0.002652532
## 
## Residual Deviance: 1985.424 
## AIC: 1993.424

- The intercept terms model the probabilities of party identification when income is zero.

cc <- c(0, -1.17493, -0.95036)
exp(cc) / sum(exp(cc))

## [1] 0.5898166 0.1821593 0.2280242

- The slope terms represent the log-odds of moving from the baseline category `Democrat` to `Independent` and `Republican` respectively for a unit change of $1000 in income.

(pp <- predict(mmodi, data.frame(nincome = c(0, 1)), type = "probs"))

##    Democrat Independent Republican
## 1 0.5898168   0.1821588  0.2280244
## 2 0.5857064   0.1838228  0.2304708

log(pp[1, 1] * pp[2, 2] / (pp[1, 2] * pp[2, 1]))

## [1] 0.01608683

log(pp[1, 1] * pp[2, 3] / (pp[1, 3] * pp[2, 1]))

## [1] 0.01766457

We can obtain predicted values for specified values of income.

il <- c(8, 26, 42, 58, 74, 90, 107)
predict(mmodi, data.frame(nincome = il), type = "probs") %>%
  as_tibble() %>%
  mutate(income = il) %>%
  pivot_longer(Democrat:Republican, names_to = "sPID", values_to = "prob") %>%
  ggplot() + 
  geom_line(mapping = aes(x = income, y = prob, group = sPID, color = sPID)) +
  scale_color_manual(values = c("blue", "green", "red")) + 
  labs(x = "Income", y = "Probability")

Hierarchical or nested responses

Consider following data concerning live births with deformations of the central nervous system (CNS) in south Wales.

(cns <- as_tibble(cns))

## # A tibble: 16 x 7
##    Area          NoCNS    An    Sp Other Water Work     
##    <fct>         <int> <int> <int> <int> <int> <fct>    
##  1 Cardiff        4091     5     9     5   110 NonManual
##  2 Newport        1515     1     7     0   100 NonManual
##  3 Swansea        2394     9     5     0    95 NonManual
##  4 GlamorganE     3163     9    14     3    42 NonManual
##  5 GlamorganW     1979     5    10     1    39 NonManual
##  6 GlamorganC     4838    11    12     2   161 NonManual
##  7 MonmouthV      2362     6     8     4    83 NonManual
##  8 MonmouthOther  1604     3     6     0   122 NonManual
##  9 Cardiff        9424    31    33    14   110 Manual   
## 10 Newport        4610     3    15     6   100 Manual   
## 11 Swansea        5526    19    30     4    95 Manual   
## 12 GlamorganE    13217    55    71    19    42 Manual   
## 13 GlamorganW     8195    30    44    10    39 Manual   
## 14 GlamorganC     7803    25    28    12   161 Manual   
## 15 MonmouthV      9962    36    37    13    83 Manual   
## 16 MonmouthOther  3172     8    13     3   122 Manual

Responses:
- NoCNS: no CNS
- An: anencephalus
- Sp: sina bifida
- Other: other malformations
Predictors:
- Water: water hardness
- Work: type of work performed by the parents
We might consider a multinomial response with four categories. However, we can see that most births suffer no malformation and so this category dominates the other three. It is better to consider this as a hierarchical response. Consider the multinomial likelihood for the $i$-th observation which is proportional to: \[ p_{i1}^{y_{i1}} p_{i2}^{y_{i2}} p_{i3}^{y_{i3}} p_{i4}^{y_{i4}} \] Define $p_{ic} = p_{i2} + p_{i3} + p_{i4}$ which is the probability of a birth with some kind of CNS malformation. We can then write the likelihood as \[ p_{i1}^{y_{i1}} p_{ic}^{y_{i2} + y_{i3} + y_{i4}} \times \left( \frac{p_{i2}}{p_{ic}} \right)^{y_{i2}} \left( \frac{p_{i3}}{p_{ic}} \right)^{y_{i3}} \left( \frac{p_{i4}}{p_{ic}} \right)^{y_{i4}} \] We can now separatly develop a binomial model for whether malformation occurs and a multinomial model for the type of malformation.

Ordinal multinomial responses

Suppose we have $J$ ordered categories and $p_{ij} = \mathbb{P}(Y_i = j)$ for $j=1, \ldots, J$. For ordered responses, it is more convenient to work with the cumulative probabilities \[ \gamma_{ij} = \mathbb{P}(Y_i \le j). \] Since $\gamma_{iJ}=1$, we only need to model $J-1$ cumulative probabilities.
To link $\gamma$s to covariats $x$, we consider \[ g(\gamma_{ij}) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}, \] where $\theta_j$ have to be non-decreasing in $j$ to honor the ordering. In other words, $\boldsymbol{\beta}$ has a uniform effect on the response categories and each category has its own intercept. Again we have at least 3 choices for the link function: logit, probit or cloglog.
Latent variable interpretation.

theta <- c(-2, -1, 2)
tibble(t = seq(-6, 6, 0.1), pdf = dlogis(t, location = 0, scale = 1)) %>%
  ggplot() + 
  geom_line(mapping = aes(x = t, y = pdf)) + 
  geom_segment(mapping = aes(x = theta[1], y = 0, xend = theta[1], yend = dlogis(theta[1]))) + 
  geom_segment(mapping = aes(x = theta[2], y = 0, xend = theta[2], yend = dlogis(theta[2]))) + 
  geom_segment(mapping = aes(x = theta[3], y = 0, xend = theta[3], yend = dlogis(theta[3]))) +
  labs(x = "t", y = "Density") + 
  annotate("text", x = theta[1], y = 0, label = expression(theta[1])) + 
  annotate("text", x = theta[2], y = 0, label = expression(theta[2])) + 
  annotate("text", x = theta[3], y = 0, label = expression(theta[3]))

## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

Proportional odds model

The logit link function dictates \[ \log \frac{\gamma_{ij}}{1 - \gamma_{ij}} = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}, \quad j = 1,\ldots,J-1. \]
The corresponding inverse link function is \[ \gamma_{ij} = \frac{\exp (\theta_j - \mathbf{x}_i^T \boldsymbol{\beta})}{1 + \exp (\theta_j - \mathbf{x}_i^T \boldsymbol{\beta})}. \] When $\beta_j > 0$, as $x_{ij}$ increases, $\gamma_{ij} = \mathbb{P}(Y_i \le j)$ decreases the same amount for all $j < J$, thus $Y_i$ is more likely to take the largest value $J$. This motivates the minus sign in the definition of the model since it allows easier interpretation of $\beta$.
It is called the proportional odds model because the relative odds for $y \le j$ comparing $\mathbf{x}_1$ and $\mathbf{x}_2$ are \[ \left( \frac{\gamma_j(\mathbf{x}_1)}{1 - \gamma_j(\mathbf{x}_1)} \right) / \left( \frac{\gamma_j(\mathbf{x}_2)}{1 - \gamma_j(\mathbf{x}_2)} \right) = \exp (- (\mathbf{x}_1 - \mathbf{x}_2)^T \boldsymbol{\beta}), \] which do not dependent on $j$. Here $\gamma_j(\mathbf{x}_i) = \gamma_{ij}$.
We can fit a proportional odds model using the polr function from the MASS library

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

pomod <- polr(sPID ~ age + educ + nincome, data = nes96)
summary(pomod)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = sPID ~ age + educ + nincome, data = nes96)
## 
## Coefficients:
##             Value Std. Error  t value
## age      0.005775   0.003887  1.48581
## educ.L   0.724087   0.384388  1.88374
## educ.Q  -0.781361   0.351173 -2.22500
## educ.C   0.040168   0.291762  0.13767
## educ^4  -0.019925   0.232429 -0.08573
## educ^5  -0.079413   0.191533 -0.41462
## educ^6  -0.061104   0.157747 -0.38735
## nincome  0.012739   0.002140  5.95187
## 
## Intercepts:
##                        Value   Std. Error t value
## Democrat|Independent    0.6449  0.2435     2.6479
## Independent|Republican  1.7374  0.2493     6.9694
## 
## Residual Deviance: 1984.211 
## AIC: 2004.211

Stepwise regression leads to a model with only one predictor nincome.

pomodi <- step(pomod)

## Start:  AIC=2004.21
## sPID ~ age + educ + nincome
## 
##           Df    AIC
## - educ     6 2002.8
## <none>       2004.2
## - age      1 2004.4
## - nincome  1 2038.6
## 
## Step:  AIC=2002.83
## sPID ~ age + nincome
## 
##           Df    AIC
## - age      1 2001.4
## <none>       2002.8
## - nincome  1 2047.2
## 
## Step:  AIC=2001.36
## sPID ~ nincome
## 
##           Df    AIC
## <none>       2001.4
## - nincome  1 2045.3

summary(pomodi)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = sPID ~ nincome, data = nes96)
## 
## Coefficients:
##           Value Std. Error t value
## nincome 0.01312   0.001971   6.657
## 
## Intercepts:
##                        Value   Std. Error t value
## Democrat|Independent    0.2091  0.1123     1.8627
## Independent|Republican  1.2916  0.1201    10.7526
## 
## Residual Deviance: 1995.363 
## AIC: 2001.363

Analysis of deviance also justifies the model with only nincome.

c(deviance(pomodi), pomodi$edf)

## [1] 1995.363    3.000

which can be compared to the corresponding multinomial logit model

c(deviance(pomod), pomod$edf)

## [1] 1984.211   10.000

We see the proportional odds model is justifiable.

pchisq(deviance(pomodi) - deviance(pomod), 
       pomod$edf - pomodi$edf, 
       lower.tail = FALSE)

## [1] 0.1321517

Interpretation of coefficients.
- The odds of moving from Democratic to Independent or from Independent to Republican increases by a factor of
```
exp(pomodi$coef[1])
```
```
##  nincome 
## 1.013206
```
as income increases by one unit ($1000).
- For income of $0, the predicted probability of being a Democrat is
```
ilogit(pomodi$zeta[1])
```
```
## Democrat|Independent 
##            0.5520865
```
and that of being an Independent is
```
ilogit(pomodi$zeta[2]) - ilogit(pomodi$zeta[1])
```
```
## Independent|Republican 
##              0.2323241
```
and that of being a Republican is
```
1 - ilogit(pomodi$zeta[2])
```
```
## Independent|Republican 
##              0.2155895
```
The predicted probabilities of each category at each income level is

l <- c(8, 26, 42, 58, 74, 90, 107)
predict(pomodi, data.frame(nincome = il), type = "probs")

##    Democrat Independent Republican
## 1 0.5260129   0.2401191  0.2338679
## 2 0.4670450   0.2541588  0.2787962
## 3 0.4153410   0.2617693  0.3228897
## 4 0.3654362   0.2641882  0.3703756
## 5 0.3182635   0.2612285  0.4205080
## 6 0.2745456   0.2531189  0.4723355
## 7 0.2324161   0.2395468  0.5280371

Ordered probit model

If we use the probit link, then \[ \Phi^{-1}(\gamma_j(\mathbf{x}_i)) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}, \quad j=1,\ldots,J-1. \]

opmod <- polr(sPID ~ nincome, method = "probit", data = nes96)
summary(opmod)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = sPID ~ nincome, data = nes96, method = "probit")
## 
## Coefficients:
##            Value Std. Error t value
## nincome 0.008182   0.001208   6.775
## 
## Intercepts:
##                        Value   Std. Error t value
## Democrat|Independent    0.1284  0.0694     1.8510
## Independent|Republican  0.7976  0.0722    11.0399
## 
## Residual Deviance: 1994.892 
## AIC: 2000.892

The deviance is similar to the logit link, but the coefficients appear to be different. The predictions are similar.

l <- c(8, 26, 42, 58, 74, 90, 107)
predict(opmod, data.frame(nincome = il), type = "probs")

##    Democrat Independent Republican
## 1 0.5251020   0.2428478  0.2320502
## 2 0.4664030   0.2542673  0.2793298
## 3 0.4147945   0.2602623  0.3249432
## 4 0.3646178   0.2620370  0.3733452
## 5 0.3166610   0.2595041  0.4238349
## 6 0.2716037   0.2527879  0.4756084
## 7 0.2275119   0.2414351  0.5310530

Proportional hazards model

Suppose we use the cloglog link \[ \log (- \log (1 - \gamma_j(\mathbf{x}_i))) = \theta_j - \mathbf{x}_i^T \boldsymbol{\beta}. \]
The hazard of category $j$ is the probability of falling in category $j$ given that your category is greater than $j$ \[ \text{Hazard}(j) = \mathbb{P}(Y_i = j \mid Y_i \ge j) = \frac{\mathbb{P}(Y_i = j)}{\mathbb{P}(Y_i \ge j)} = \frac{\gamma_{ij} - \gamma_{i,j-1}}{1 - \gamma_{i,j-1}}. \] The quantity $- \log \mathbb{P}(Y > j)$ is called the cumulative hazard function.
Since \[\begin{eqnarray*} 1 - \gamma_j(\mathbf{x}) &=& \mathbb{P}(Y > j) = e^{- e^{\theta_j - \mathbf{x}^T \boldsymbol{\beta}}}, \end{eqnarray*}\] we have \[\begin{eqnarray*} \log \mathbb{P}(Y > j) = - e^{\theta_j - \mathbf{x}^T \boldsymbol{\beta}} \end{eqnarray*}\] and \[ \frac{- \log \mathbb{P}(Y > j \mid \mathbf{x}_1)}{- \log \mathbb{P}(Y > j \mid \mathbf{x}_2)} = e^{(\mathbf{x}_2 - \mathbf{x}_1)^T \boldsymbol{\beta}} \] or \[ \mathbb{P}(Y > j \mid \mathbf{x}_1) = [\mathbb{P}(Y > j \mid \mathbf{x}_2)]^{\exp (\mathbf{x}_2 - \mathbf{x}_1)^T \boldsymbol{\beta}}. \] It is called the proportional hazards model because the ratio of cumulative hazards does not depend on level $j$.

polr(sPID ~ nincome, method = "cloglog", data = nes96)

## Call:
## polr(formula = sPID ~ nincome, data = nes96, method = "cloglog")
## 
## Coefficients:
##     nincome 
## 0.008271135 
## 
## Intercepts:
##   Democrat|Independent Independent|Republican 
##             -0.2866703              0.4564409 
## 
## Residual Deviance: 2003.96 
## AIC: 2009.96

We see a relatively worse fit than proportional odds and ordered probit models.