To sumbit homework, please email Rmd and html files to TA Shanpeng Li by the deadline.
Write down the log-likelihood function of logistic regresion for binomial responses.
Derive the gradient vector and Hessian matrix of the log-likelhood function with respect to the regression coefficients \(\boldsymbol{\beta}\).
Show that the log-likelihood function of logistic regression is a concave function in regression coefficients \(\boldsymbol{\beta}\). (Hint: show that the negative Hessian is a positive semidefinite matrix.)
Of primary interest to public is the risk of dying from COVID-19. A commonly used measure is case fatality rate/ratio/risk (CFR), which is defined as \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}. \] Apparently CFR is not a fixed constant; it changes with time, location, and other factors. Also CFR is different from the infection fatality rate (IFR), the probability that someone infected with COVID-19 dies from it.
In this exercise, we use logistic regression to study how US county-level CFR changes according to demographic information and some health-, education-, and economy-indicators.
04-04-2020.csv.gz
: The data on COVID-19 confirmed cases and deaths on 2020-04-04 is retrieved from the Johns Hopkins COVID-19 data repository. It was downloaded from this link (commit 0174f38).
us-county-health-rankings-2020.csv.gz
: The 2020 County Health Ranking Data was released by County Health Rankings. The data was downloaded from the Kaggle Uncover COVID-19 Challenge (version 1).
Load the tidyverse
package for data manipulation and visualization.
# tidyverse of data manipulation and visualization
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Read in the data of COVID-19 cases reported on 2020-04-04.
county_count <- read_csv("./04-04-2020.csv.gz") %>%
# cast fips into dbl for use as a key for joining tables
mutate(FIPS = as.numeric(FIPS)) %>%
filter(Country_Region == "US") %>%
print(width = Inf)
## Parsed with column specification:
## cols(
## FIPS = col_character(),
## Admin2 = col_character(),
## Province_State = col_character(),
## Country_Region = col_character(),
## Last_Update = col_datetime(format = ""),
## Lat = col_double(),
## Long_ = col_double(),
## Confirmed = col_double(),
## Deaths = col_double(),
## Recovered = col_double(),
## Active = col_double(),
## Combined_Key = col_character()
## )
## # A tibble: 2,421 x 12
## FIPS Admin2 Province_State Country_Region Last_Update Lat
## <dbl> <chr> <chr> <chr> <dttm> <dbl>
## 1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
## 2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
## 3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
## 4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
## 5 19001 Adair Iowa US 2020-04-04 23:34:21 41.3
## 6 21001 Adair Kentucky US 2020-04-04 23:34:21 37.1
## 7 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
## 8 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
## 9 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
## 10 16003 Adams Idaho US 2020-04-04 23:34:21 44.9
## Long_ Confirmed Deaths Recovered Active Combined_Key
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -82.5 6 0 0 0 Abbeville, South Carolina, US
## 2 -92.4 65 2 0 0 Acadia, Louisiana, US
## 3 -75.6 8 0 0 0 Accomack, Virginia, US
## 4 -116. 360 3 0 0 Ada, Idaho, US
## 5 -94.5 1 0 0 0 Adair, Iowa, US
## 6 -85.3 3 0 0 0 Adair, Kentucky, US
## 7 -92.6 10 0 0 0 Adair, Missouri, US
## 8 -94.7 14 0 0 0 Adair, Oklahoma, US
## 9 -104. 294 9 0 0 Adams, Colorado, US
## 10 -116. 1 0 0 0 Adams, Idaho, US
## # … with 2,411 more rows
Standardize the variable names by changing them to lower case.
names(county_count) <- str_to_lower(names(county_count))
Sanity check by displaying the unique US states and territories:
county_count %>%
select(province_state) %>%
distinct() %>%
arrange(province_state) %>%
print(n = Inf)
## # A tibble: 58 x 1
## province_state
## <chr>
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
## 7 Connecticut
## 8 Delaware
## 9 Diamond Princess
## 10 District of Columbia
## 11 Florida
## 12 Georgia
## 13 Grand Princess
## 14 Guam
## 15 Hawaii
## 16 Idaho
## 17 Illinois
## 18 Indiana
## 19 Iowa
## 20 Kansas
## 21 Kentucky
## 22 Louisiana
## 23 Maine
## 24 Maryland
## 25 Massachusetts
## 26 Michigan
## 27 Minnesota
## 28 Mississippi
## 29 Missouri
## 30 Montana
## 31 Nebraska
## 32 Nevada
## 33 New Hampshire
## 34 New Jersey
## 35 New Mexico
## 36 New York
## 37 North Carolina
## 38 North Dakota
## 39 Northern Mariana Islands
## 40 Ohio
## 41 Oklahoma
## 42 Oregon
## 43 Pennsylvania
## 44 Puerto Rico
## 45 Recovered
## 46 Rhode Island
## 47 South Carolina
## 48 South Dakota
## 49 Tennessee
## 50 Texas
## 51 Utah
## 52 Vermont
## 53 Virgin Islands
## 54 Virginia
## 55 Washington
## 56 West Virginia
## 57 Wisconsin
## 58 Wyoming
We want to exclude entries from Diamond Princess
, Grand Princess
, Guam
, Northern Mariana Islands
, Puerto Rico
, Recovered
, and Virgin Islands
, and only consider counties from 50 states and DC.
county_count <- county_count %>%
filter(!(province_state %in% c("Diamond Princess", "Grand Princess",
"Recovered", "Guam", "Northern Mariana Islands",
"Puerto Rico", "Virgin Islands"))) %>%
print(width = Inf)
## # A tibble: 2,413 x 12
## fips admin2 province_state country_region last_update lat
## <dbl> <chr> <chr> <chr> <dttm> <dbl>
## 1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
## 2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
## 3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
## 4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
## 5 19001 Adair Iowa US 2020-04-04 23:34:21 41.3
## 6 21001 Adair Kentucky US 2020-04-04 23:34:21 37.1
## 7 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
## 8 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
## 9 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
## 10 16003 Adams Idaho US 2020-04-04 23:34:21 44.9
## long_ confirmed deaths recovered active combined_key
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -82.5 6 0 0 0 Abbeville, South Carolina, US
## 2 -92.4 65 2 0 0 Acadia, Louisiana, US
## 3 -75.6 8 0 0 0 Accomack, Virginia, US
## 4 -116. 360 3 0 0 Ada, Idaho, US
## 5 -94.5 1 0 0 0 Adair, Iowa, US
## 6 -85.3 3 0 0 0 Adair, Kentucky, US
## 7 -92.6 10 0 0 0 Adair, Missouri, US
## 8 -94.7 14 0 0 0 Adair, Oklahoma, US
## 9 -104. 294 9 0 0 Adams, Colorado, US
## 10 -116. 1 0 0 0 Adams, Idaho, US
## # … with 2,403 more rows
Graphical summarize the COVID-19 confirmed cases and deaths on 2020-04-04 by state.
county_count %>%
# turn into long format for easy plotting
pivot_longer(confirmed:recovered,
names_to = "case",
values_to = "count") %>%
group_by(province_state) %>%
ggplot() +
geom_col(mapping = aes(x = province_state, y = `count`, fill = `case`)) +
# scale_y_log10() +
labs(title = "US COVID-19 Situation on 2020-04-04", x = "State") +
theme(axis.text.x = element_text(angle = 90))
Read in the 2020 county-level health ranking data.
county_info <- read_csv("./us-county-health-rankings-2020.csv.gz") %>%
filter(!is.na(county)) %>%
# cast fips into dbl for use as a key for joining tables
mutate(fips = as.numeric(fips)) %>%
select(fips,
state,
county,
percent_fair_or_poor_health,
percent_smokers,
percent_adults_with_obesity,
# food_environment_index,
percent_with_access_to_exercise_opportunities,
percent_excessive_drinking,
# teen_birth_rate,
percent_uninsured,
# primary_care_physicians_rate,
# preventable_hospitalization_rate,
# high_school_graduation_rate,
percent_some_college,
percent_unemployed,
percent_children_in_poverty,
# `80th_percentile_income`,
# `20th_percentile_income`,
percent_single_parent_households,
# violent_crime_rate,
percent_severe_housing_problems,
overcrowding,
# life_expectancy,
# age_adjusted_death_rate,
percent_adults_with_diabetes,
# hiv_prevalence_rate,
percent_food_insecure,
# percent_limited_access_to_healthy_foods,
percent_insufficient_sleep,
percent_uninsured_2,
median_household_income,
average_traffic_volume_per_meter_of_major_roadways,
percent_homeowners,
# percent_severe_housing_cost_burden,
population_2,
percent_less_than_18_years_of_age,
percent_65_and_over,
percent_black,
percent_asian,
percent_hispanic,
percent_female,
percent_rural) %>%
print(width = Inf)
## Parsed with column specification:
## cols(
## .default = col_double(),
## state = col_character(),
## county = col_character(),
## unreliable = col_character(),
## primary_care_physicians_ratio = col_character(),
## dentist_ratio = col_character(),
## mental_health_provider_ratio = col_character(),
## presence_of_water_violation = col_logical(),
## other_primary_care_provider_ratio = col_character(),
## non_petitioned_cases = col_logical(),
## petitioned_cases = col_logical()
## )
## See spec(...) for full column specifications.
## # A tibble: 3,142 x 30
## fips state county percent_fair_or_poor_health percent_smokers
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1001 Alabama Autauga 20.9 18.1
## 2 1003 Alabama Baldwin 17.5 17.5
## 3 1005 Alabama Barbour 29.6 22.0
## 4 1007 Alabama Bibb 19.4 19.1
## 5 1009 Alabama Blount 21.7 19.2
## 6 1011 Alabama Bullock 31.0 22.9
## 7 1013 Alabama Butler 27.9 21.8
## 8 1015 Alabama Calhoun 23.1 20.6
## 9 1017 Alabama Chambers 24.0 19.4
## 10 1019 Alabama Cherokee 20.7 17.5
## percent_adults_with_obesity percent_with_access_to_exercise_opportunities
## <dbl> <dbl>
## 1 33.3 69.1
## 2 31 73.7
## 3 41.7 53.2
## 4 37.6 16.3
## 5 33.8 15.6
## 6 37.2 2.50
## 7 43.3 48.6
## 8 38.5 47.7
## 9 40.1 61.9
## 10 35 33.4
## percent_excessive_drinking percent_uninsured percent_some_college
## <dbl> <dbl> <dbl>
## 1 15.0 8.72 62.0
## 2 18.0 11.3 67.4
## 3 12.8 12.2 34.9
## 4 15.6 10.2 44.1
## 5 14.2 13.4 53.4
## 6 12.1 11.4 35.0
## 7 11.9 11.2 41.7
## 8 13.8 11.9 59.2
## 9 12.7 11.9 48.5
## 10 14.1 11.2 51.8
## percent_unemployed percent_children_in_poverty
## <dbl> <dbl>
## 1 3.63 19.3
## 2 3.62 13.9
## 3 5.17 43.9
## 4 3.97 27.8
## 5 3.51 18
## 6 4.69 68.3
## 7 4.79 36.3
## 8 4.65 26.5
## 9 3.91 30.7
## 10 3.57 24.7
## percent_single_parent_households percent_severe_housing_problems overcrowding
## <dbl> <dbl> <dbl>
## 1 26.2 14.7 1.20
## 2 24.1 13.6 1.27
## 3 56.6 14.6 1.69
## 4 28.7 10.5 0.255
## 5 28.6 10.5 1.89
## 6 74.8 18.1 0.113
## 7 52.7 13.2 1.69
## 8 40.2 13.7 1.54
## 9 46.6 16.0 4.04
## 10 23.8 13 1.5
## percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
## <dbl> <dbl> <dbl>
## 1 11.1 13.2 35.9
## 2 10.7 11.6 33.3
## 3 17.6 22 38.6
## 4 14.5 14.3 38.1
## 5 17 10.7 35.9
## 6 23.7 24.8 45.0
## 7 19.2 20.6 41.9
## 8 17.5 15.7 41.3
## 9 19.9 17.9 37.3
## 10 15.2 12.5 35.4
## percent_uninsured_2 median_household_income
## <dbl> <dbl>
## 1 11.1 59338
## 2 14.3 57588
## 3 16.1 34382
## 4 13 46064
## 5 17.1 50412
## 6 15.2 29267
## 7 14.5 37365
## 8 15.4 45400
## 9 15.2 39917
## 10 13.9 42132
## average_traffic_volume_per_meter_of_major_roadways percent_homeowners
## <dbl> <dbl>
## 1 88.5 74.9
## 2 87.0 73.6
## 3 102. 61.4
## 4 29.3 75.1
## 5 33.4 78.6
## 6 4.07 75.5
## 7 19.3 69.9
## 8 110. 69.5
## 9 20.3 67.8
## 10 25.9 79.0
## population_2 percent_less_than_18_years_of_age percent_65_and_over
## <dbl> <dbl> <dbl>
## 1 55601 23.7 15.6
## 2 218022 21.6 20.4
## 3 24881 20.9 19.4
## 4 22400 20.5 16.5
## 5 57840 23.2 18.2
## 6 10138 21.1 16.4
## 7 19680 22.2 20.3
## 8 114277 21.6 17.7
## 9 33615 20.8 19.5
## 10 26032 19.2 23.0
## percent_black percent_asian percent_hispanic percent_female percent_rural
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.3 1.22 2.97 51.4 42.0
## 2 8.78 1.15 4.65 51.5 42.3
## 3 48.0 0.454 4.28 47.2 67.8
## 4 21.1 0.237 2.62 46.8 68.4
## 5 1.46 0.320 9.57 50.7 90.0
## 6 69.5 0.187 7.96 45.5 51.4
## 7 44.6 1.32 1.51 53.4 71.2
## 8 20.9 0.964 3.91 51.9 33.7
## 9 39.6 1.33 2.56 52.1 49.1
## 10 4.24 0.338 1.62 50.5 85.7
## # … with 3,132 more rows
For stability in estimating CFR, we restrict to counties with \(\ge 5\) confirmed cases.
county_count <- county_count %>%
filter(confirmed >= 5)
We join the COVID-19 count data and county-level information using FIPS (Federal Information Processing System) as key.
county_data <- county_count %>%
left_join(county_info, by = "fips") %>%
print(width = Inf)
## # A tibble: 1,466 x 41
## fips admin2 province_state country_region last_update lat
## <dbl> <chr> <chr> <chr> <dttm> <dbl>
## 1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
## 2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
## 3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
## 4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
## 5 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
## 6 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
## 7 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
## 8 28001 Adams Mississippi US 2020-04-04 23:34:21 31.5
## 9 31001 Adams Nebraska US 2020-04-04 23:34:21 40.5
## 10 42001 Adams Pennsylvania US 2020-04-04 23:34:21 39.9
## long_ confirmed deaths recovered active combined_key
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -82.5 6 0 0 0 Abbeville, South Carolina, US
## 2 -92.4 65 2 0 0 Acadia, Louisiana, US
## 3 -75.6 8 0 0 0 Accomack, Virginia, US
## 4 -116. 360 3 0 0 Ada, Idaho, US
## 5 -92.6 10 0 0 0 Adair, Missouri, US
## 6 -94.7 14 0 0 0 Adair, Oklahoma, US
## 7 -104. 294 9 0 0 Adams, Colorado, US
## 8 -91.4 16 0 0 0 Adams, Mississippi, US
## 9 -98.5 8 0 0 0 Adams, Nebraska, US
## 10 -77.2 21 0 0 0 Adams, Pennsylvania, US
## state county percent_fair_or_poor_health percent_smokers
## <chr> <chr> <dbl> <dbl>
## 1 South Carolina Abbeville 19.9 17.3
## 2 Louisiana Acadia 20.9 21.5
## 3 Virginia Accomack 20.1 18.3
## 4 Idaho Ada 11.5 12.0
## 5 Missouri Adair 21.4 20.5
## 6 Oklahoma Adair 28.5 27.7
## 7 Colorado Adams 16.6 16.3
## 8 Mississippi Adams 27.3 22.2
## 9 Nebraska Adams 15.8 14.6
## 10 Pennsylvania Adams 15.3 16.2
## percent_adults_with_obesity percent_with_access_to_exercise_opportunities
## <dbl> <dbl>
## 1 36.7 59.0
## 2 38.4 42.5
## 3 36.3 37.4
## 4 25.6 89.5
## 5 27.9 78.3
## 6 47.7 28.5
## 7 27.8 93.1
## 8 35.3 69.1
## 9 36.7 81.6
## 10 35.6 60.6
## percent_excessive_drinking percent_uninsured percent_some_college
## <dbl> <dbl> <dbl>
## 1 15.9 12.9 52.5
## 2 19.8 10.7 43.6
## 3 15.5 16.6 45.1
## 4 17.9 8.74 73.8
## 5 18.9 10.6 65.3
## 6 11.8 24.5 35.1
## 7 18.9 11.0 57.0
## 8 12.3 15.0 41.7
## 9 18.5 8.76 70.8
## 10 19.2 7.49 57.3
## percent_unemployed percent_children_in_poverty
## <dbl> <dbl>
## 1 3.98 30.8
## 2 5.37 35.4
## 3 3.81 27
## 4 2.46 10.2
## 5 3.51 19.9
## 6 4.17 34.9
## 7 3.47 12.6
## 8 6.21 40.4
## 9 2.87 14.4
## 10 3.27 11.2
## percent_single_parent_households percent_severe_housing_problems overcrowding
## <dbl> <dbl> <dbl>
## 1 37.1 14.3 0.463
## 2 33.4 12.3 3.51
## 3 45.9 15.1 2.10
## 4 23.8 14.0 1.46
## 5 29.5 18.0 0.740
## 6 38.3 15.4 5.65
## 7 31.0 18.1 5.37
## 8 66.4 12.8 2.37
## 9 26.2 10.5 0.904
## 10 26.7 12.3 1.88
## percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
## <dbl> <dbl> <dbl>
## 1 15.8 15.2 36.1
## 2 11.4 15.1 32.4
## 3 15.9 14.1 36.8
## 4 7.9 12 26.3
## 5 8.4 17.5 31.9
## 6 24.3 19.1 39.5
## 7 7.7 8 31.0
## 8 13.2 24.7 41.1
## 9 11 11.7 30.1
## 10 8.5 8.3 34.7
## percent_uninsured_2 median_household_income
## <dbl> <dbl>
## 1 15.9 42412
## 2 14.0 40484
## 3 19.4 42879
## 4 11.1 66827
## 5 12.3 40395
## 6 29.6 35156
## 7 13.8 70199
## 8 18.7 33392
## 9 10.7 55167
## 10 8.46 62877
## average_traffic_volume_per_meter_of_major_roadways percent_homeowners
## <dbl> <dbl>
## 1 11.6 76.3
## 2 63.7 70.8
## 3 60.0 67.9
## 4 277. 68.4
## 5 45.8 60.0
## 6 16.7 68.6
## 7 490. 65.2
## 8 150. 61.7
## 9 53.4 68.2
## 10 113. 77.2
## population_2 percent_less_than_18_years_of_age percent_65_and_over
## <dbl> <dbl> <dbl>
## 1 24541 20.1 21.8
## 2 62190 25.8 15.3
## 3 32412 20.5 23.6
## 4 469966 23.8 14.4
## 5 25339 18.4 14.8
## 6 22082 26.6 15.9
## 7 511868 26.5 10.5
## 8 31192 20.1 18.8
## 9 31511 23.7 18.2
## 10 102811 20.0 20.4
## percent_black percent_asian percent_hispanic percent_female percent_rural
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 27.5 0.412 1.54 51.6 78.6
## 2 17.9 0.320 2.73 51.2 51.7
## 3 28.0 0.781 9.34 51.2 100
## 4 1.24 2.81 8.31 49.9 5.47
## 5 2.85 2.28 2.57 51.9 37.9
## 6 0.534 0.802 6.82 50.1 83.3
## 7 3.19 4.37 40.4 49.5 3.62
## 8 52.4 0.513 11.3 47.9 37.2
## 9 0.996 1.33 10.9 50.2 22.5
## 10 1.60 0.875 7.11 50.8 53.7
## # … with 1,456 more rows
Numerical summaries of each variable:
summary(county_data)
## fips admin2 province_state country_region
## Min. : 1001 Length:1466 Length:1466 Length:1466
## 1st Qu.:18003 Class :character Class :character Class :character
## Median :29029 Mode :character Mode :character Mode :character
## Mean :30076
## 3rd Qu.:42077
## Max. :90053
## NA's :13
## last_update lat long_
## Min. :2020-04-04 23:34:21 Min. :19.60 Min. :-159.60
## 1st Qu.:2020-04-04 23:34:21 1st Qu.:33.96 1st Qu.: -94.56
## Median :2020-04-04 23:34:21 Median :38.02 Median : -86.48
## Mean :2020-04-04 23:34:21 Mean :37.71 Mean : -89.73
## 3rd Qu.:2020-04-04 23:34:21 3rd Qu.:41.38 3rd Qu.: -81.22
## Max. :2020-04-04 23:34:21 Max. :64.81 Max. : -68.65
## NA's :19 NA's :19
## confirmed deaths recovered active
## Min. : 5.0 Min. : 0.000 Min. :0 Min. :0
## 1st Qu.: 9.0 1st Qu.: 0.000 1st Qu.:0 1st Qu.:0
## Median : 20.0 Median : 0.000 Median :0 Median :0
## Mean : 208.8 Mean : 4.842 Mean :0 Mean :0
## 3rd Qu.: 68.0 3rd Qu.: 2.000 3rd Qu.:0 3rd Qu.:0
## Max. :63306.0 Max. :1905.000 Max. :0 Max. :0
##
## combined_key state county
## Length:1466 Length:1466 Length:1466
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
## Min. : 8.121 Min. : 5.909 Min. :12.40
## 1st Qu.:14.390 1st Qu.:14.899 1st Qu.:29.10
## Median :17.010 Median :17.147 Median :32.95
## Mean :17.594 Mean :17.153 Mean :32.41
## 3rd Qu.:20.377 3rd Qu.:19.365 3rd Qu.:36.20
## Max. :38.887 Max. :27.775 Max. :51.00
## NA's :28 NA's :28 NA's :28
## percent_with_access_to_exercise_opportunities percent_excessive_drinking
## Min. : 0.00 Min. : 7.81
## 1st Qu.: 59.95 1st Qu.:15.70
## Median : 74.71 Median :18.03
## Mean : 71.14 Mean :17.92
## 3rd Qu.: 85.94 3rd Qu.:20.00
## Max. :100.00 Max. :28.62
## NA's :28 NA's :28
## percent_uninsured percent_some_college percent_unemployed
## Min. : 2.263 Min. :30.06 Min. : 1.582
## 1st Qu.: 6.754 1st Qu.:53.24 1st Qu.: 3.252
## Median : 9.925 Median :61.21 Median : 3.870
## Mean :10.583 Mean :60.87 Mean : 4.071
## 3rd Qu.:13.519 3rd Qu.:68.74 3rd Qu.: 4.690
## Max. :31.208 Max. :90.34 Max. :18.092
## NA's :28 NA's :28 NA's :28
## percent_children_in_poverty percent_single_parent_households
## Min. : 2.50 Min. : 9.43
## 1st Qu.:12.82 1st Qu.:27.09
## Median :18.40 Median :32.96
## Mean :19.46 Mean :33.84
## 3rd Qu.:24.50 3rd Qu.:38.94
## Max. :55.00 Max. :80.00
## NA's :28 NA's :28
## percent_severe_housing_problems overcrowding percent_adults_with_diabetes
## Min. : 6.562 Min. : 0.000 Min. : 1.800
## 1st Qu.:12.267 1st Qu.: 1.378 1st Qu.: 9.125
## Median :14.439 Median : 1.962 Median :11.300
## Mean :15.079 Mean : 2.429 Mean :11.749
## 3rd Qu.:16.976 3rd Qu.: 2.882 3rd Qu.:13.900
## Max. :33.391 Max. :14.489 Max. :34.100
## NA's :28 NA's :28 NA's :28
## percent_food_insecure percent_insufficient_sleep percent_uninsured_2
## Min. : 3.40 Min. :23.03 Min. : 2.683
## 1st Qu.:10.70 1st Qu.:31.42 1st Qu.: 7.865
## Median :12.70 Median :34.02 Median :12.027
## Mean :13.26 Mean :33.88 Mean :12.776
## 3rd Qu.:15.20 3rd Qu.:36.56 3rd Qu.:16.541
## Max. :33.50 Max. :46.71 Max. :42.397
## NA's :28 NA's :28 NA's :28
## median_household_income average_traffic_volume_per_meter_of_major_roadways
## Min. : 25385 Min. : 0.00
## 1st Qu.: 46994 1st Qu.: 53.05
## Median : 54317 Median : 105.00
## Mean : 57584 Mean : 201.39
## 3rd Qu.: 64754 3rd Qu.: 206.92
## Max. :140382 Max. :4444.12
## NA's :28 NA's :28
## percent_homeowners population_2 percent_less_than_18_years_of_age
## Min. :24.13 Min. : 2887 Min. : 7.069
## 1st Qu.:64.34 1st Qu.: 36502 1st Qu.:20.321
## Median :69.98 Median : 75478 Median :22.182
## Mean :68.98 Mean : 202450 Mean :22.197
## 3rd Qu.:74.78 3rd Qu.: 180031 3rd Qu.:24.002
## Max. :89.76 Max. :10105518 Max. :35.447
## NA's :28 NA's :28 NA's :28
## percent_65_and_over percent_black percent_asian percent_hispanic
## Min. : 7.722 Min. : 0.1286 Min. : 0.06245 Min. : 0.7952
## 1st Qu.:14.927 1st Qu.: 1.6175 1st Qu.: 0.68248 1st Qu.: 2.9419
## Median :17.222 Median : 5.6397 Median : 1.23421 Median : 5.5939
## Mean :17.516 Mean :12.4178 Mean : 2.40412 Mean :10.0010
## 3rd Qu.:19.598 3rd Qu.:17.5931 3rd Qu.: 2.67550 3rd Qu.:11.0564
## Max. :57.587 Max. :81.9544 Max. :42.95231 Max. :96.3595
## NA's :28 NA's :28 NA's :28 NA's :28
## percent_female percent_rural
## Min. :34.63 Min. : 0.00
## 1st Qu.:50.00 1st Qu.: 17.11
## Median :50.66 Median : 36.97
## Mean :50.46 Mean : 40.11
## 3rd Qu.:51.35 3rd Qu.: 60.00
## Max. :56.87 Max. :100.00
## NA's :28 NA's :28
List rows in county_data
that don’t have a match in county_count
:
county_data %>%
filter(is.na(state) & is.na(county)) %>%
print(n = Inf)
## # A tibble: 28 x 41
## fips admin2 province_state country_region last_update lat long_
## <dbl> <chr> <chr> <chr> <dttm> <dbl> <dbl>
## 1 NA DeKalb Tennessee US 2020-04-04 23:34:21 36.0 -85.8
## 2 NA DeSoto Florida US 2020-04-04 23:34:21 27.2 -81.8
## 3 NA Dukes… Massachusetts US 2020-04-04 23:34:21 41.4 -70.7
## 4 NA Fillm… Minnesota US 2020-04-04 23:34:21 43.7 -92.1
## 5 NA Kansa… Missouri US 2020-04-04 23:34:21 39.1 -94.6
## 6 NA LaSal… Illinois US 2020-04-04 23:34:21 41.3 -88.9
## 7 NA Manas… Virginia US 2020-04-04 23:34:21 38.7 -77.5
## 8 NA McDuf… Georgia US 2020-04-04 23:34:21 33.5 -82.5
## 9 NA Out o… Michigan US 2020-04-04 23:34:21 NA NA
## 10 NA Out o… Tennessee US 2020-04-04 23:34:21 NA NA
## 11 90005 Unass… Arkansas US 2020-04-04 23:34:21 NA NA
## 12 90008 Unass… Colorado US 2020-04-04 23:34:21 NA NA
## 13 90009 Unass… Connecticut US 2020-04-04 23:34:21 NA NA
## 14 90013 Unass… Georgia US 2020-04-04 23:34:21 NA NA
## 15 90015 Unass… Hawaii US 2020-04-04 23:34:21 NA NA
## 16 90017 Unass… Illinois US 2020-04-04 23:34:21 NA NA
## 17 90021 Unass… Kentucky US 2020-04-04 23:34:21 NA NA
## 18 NA Unass… Louisiana US 2020-04-04 23:34:21 NA NA
## 19 90023 Unass… Maine US 2020-04-04 23:34:21 NA NA
## 20 90025 Unass… Massachusetts US 2020-04-04 23:34:21 NA NA
## 21 NA Unass… Michigan US 2020-04-04 23:34:21 NA NA
## 22 90032 Unass… Nevada US 2020-04-04 23:34:21 NA NA
## 23 90034 Unass… New Jersey US 2020-04-04 23:34:21 NA NA
## 24 90044 Unass… Rhode Island US 2020-04-04 23:34:21 NA NA
## 25 90047 Unass… Tennessee US 2020-04-04 23:34:21 NA NA
## 26 90050 Unass… Vermont US 2020-04-04 23:34:21 NA NA
## 27 90053 Unass… Washington US 2020-04-04 23:34:21 NA NA
## 28 NA Weber Utah US 2020-04-04 23:34:21 41.3 -112.
## # … with 34 more variables: confirmed <dbl>, deaths <dbl>, recovered <dbl>,
## # active <dbl>, combined_key <chr>, state <chr>, county <chr>,
## # percent_fair_or_poor_health <dbl>, percent_smokers <dbl>,
## # percent_adults_with_obesity <dbl>,
## # percent_with_access_to_exercise_opportunities <dbl>,
## # percent_excessive_drinking <dbl>, percent_uninsured <dbl>,
## # percent_some_college <dbl>, percent_unemployed <dbl>,
## # percent_children_in_poverty <dbl>, percent_single_parent_households <dbl>,
## # percent_severe_housing_problems <dbl>, overcrowding <dbl>,
## # percent_adults_with_diabetes <dbl>, percent_food_insecure <dbl>,
## # percent_insufficient_sleep <dbl>, percent_uninsured_2 <dbl>,
## # median_household_income <dbl>,
## # average_traffic_volume_per_meter_of_major_roadways <dbl>,
## # percent_homeowners <dbl>, population_2 <dbl>,
## # percent_less_than_18_years_of_age <dbl>, percent_65_and_over <dbl>,
## # percent_black <dbl>, percent_asian <dbl>, percent_hispanic <dbl>,
## # percent_female <dbl>, percent_rural <dbl>
We found there are some rows that miss fips
.
county_count %>%
filter(is.na(fips)) %>%
select(fips, admin2, province_state) %>%
print(n = Inf)
## # A tibble: 13 x 3
## fips admin2 province_state
## <dbl> <chr> <chr>
## 1 NA DeKalb Tennessee
## 2 NA DeSoto Florida
## 3 NA Dukes and Nantucket Massachusetts
## 4 NA Fillmore Minnesota
## 5 NA Kansas City Missouri
## 6 NA LaSalle Illinois
## 7 NA Manassas Virginia
## 8 NA McDuffie Georgia
## 9 NA Out of MI Michigan
## 10 NA Out of TN Tennessee
## 11 NA Unassigned Louisiana
## 12 NA Unassigned Michigan
## 13 NA Weber Utah
We need to (1) manually set the fips
for some counties, (2) discard those Unassigned
, unassigned
or Out of
, and (3) try to join with county_info
again.
county_data <- county_count %>%
# manually set FIPS for some counties
mutate(fips = ifelse(admin2 == "DeKalb" & province_state == "Tennessee", 47041, fips)) %>%
mutate(fips = ifelse(admin2 == "DeSoto" & province_state == "Florida", 12027, fips)) %>%
#mutate(fips = ifelse(admin2 == "Dona Ana" & province_state == "New Mexico", 35013, fips)) %>%
mutate(fips = ifelse(admin2 == "Dukes and Nantucket" & province_state == "Massachusetts", 25019, fips)) %>%
mutate(fips = ifelse(admin2 == "Fillmore" & province_state == "Minnesota", 27045, fips)) %>%
#mutate(fips = ifelse(admin2 == "Harris" & province_state == "Texas", 48201, fips)) %>%
#mutate(fips = ifelse(admin2 == "Kenai Peninsula" & province_state == "Alaska", 2122, fips)) %>%
mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Illinois", 17099, fips)) %>%
#mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Louisiana", 22059, fips)) %>%
#mutate(fips = ifelse(admin2 == "Lac qui Parle" & province_state == "Minnesota", 27073, fips)) %>%
mutate(fips = ifelse(admin2 == "Manassas" & province_state == "Virginia", 51683, fips)) %>%
#mutate(fips = ifelse(admin2 == "Matanuska-Susitna" & province_state == "Alaska", 2170, fips)) %>%
mutate(fips = ifelse(admin2 == "McDuffie" & province_state == "Georgia", 13189, fips)) %>%
#mutate(fips = ifelse(admin2 == "McIntosh" & province_state == "Georgia", 13191, fips)) %>%
#mutate(fips = ifelse(admin2 == "McKean" & province_state == "Pennsylvania", 42083, fips)) %>%
mutate(fips = ifelse(admin2 == "Weber" & province_state == "Utah", 49057, fips)) %>%
filter(!(is.na(fips) | str_detect(admin2, "Out of") | str_detect(admin2, "Unassigned"))) %>%
left_join(county_info, by = "fips") %>%
print(width = Inf)
## # A tibble: 1,446 x 41
## fips admin2 province_state country_region last_update lat
## <dbl> <chr> <chr> <chr> <dttm> <dbl>
## 1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
## 2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
## 3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
## 4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
## 5 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
## 6 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
## 7 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
## 8 28001 Adams Mississippi US 2020-04-04 23:34:21 31.5
## 9 31001 Adams Nebraska US 2020-04-04 23:34:21 40.5
## 10 42001 Adams Pennsylvania US 2020-04-04 23:34:21 39.9
## long_ confirmed deaths recovered active combined_key
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -82.5 6 0 0 0 Abbeville, South Carolina, US
## 2 -92.4 65 2 0 0 Acadia, Louisiana, US
## 3 -75.6 8 0 0 0 Accomack, Virginia, US
## 4 -116. 360 3 0 0 Ada, Idaho, US
## 5 -92.6 10 0 0 0 Adair, Missouri, US
## 6 -94.7 14 0 0 0 Adair, Oklahoma, US
## 7 -104. 294 9 0 0 Adams, Colorado, US
## 8 -91.4 16 0 0 0 Adams, Mississippi, US
## 9 -98.5 8 0 0 0 Adams, Nebraska, US
## 10 -77.2 21 0 0 0 Adams, Pennsylvania, US
## state county percent_fair_or_poor_health percent_smokers
## <chr> <chr> <dbl> <dbl>
## 1 South Carolina Abbeville 19.9 17.3
## 2 Louisiana Acadia 20.9 21.5
## 3 Virginia Accomack 20.1 18.3
## 4 Idaho Ada 11.5 12.0
## 5 Missouri Adair 21.4 20.5
## 6 Oklahoma Adair 28.5 27.7
## 7 Colorado Adams 16.6 16.3
## 8 Mississippi Adams 27.3 22.2
## 9 Nebraska Adams 15.8 14.6
## 10 Pennsylvania Adams 15.3 16.2
## percent_adults_with_obesity percent_with_access_to_exercise_opportunities
## <dbl> <dbl>
## 1 36.7 59.0
## 2 38.4 42.5
## 3 36.3 37.4
## 4 25.6 89.5
## 5 27.9 78.3
## 6 47.7 28.5
## 7 27.8 93.1
## 8 35.3 69.1
## 9 36.7 81.6
## 10 35.6 60.6
## percent_excessive_drinking percent_uninsured percent_some_college
## <dbl> <dbl> <dbl>
## 1 15.9 12.9 52.5
## 2 19.8 10.7 43.6
## 3 15.5 16.6 45.1
## 4 17.9 8.74 73.8
## 5 18.9 10.6 65.3
## 6 11.8 24.5 35.1
## 7 18.9 11.0 57.0
## 8 12.3 15.0 41.7
## 9 18.5 8.76 70.8
## 10 19.2 7.49 57.3
## percent_unemployed percent_children_in_poverty
## <dbl> <dbl>
## 1 3.98 30.8
## 2 5.37 35.4
## 3 3.81 27
## 4 2.46 10.2
## 5 3.51 19.9
## 6 4.17 34.9
## 7 3.47 12.6
## 8 6.21 40.4
## 9 2.87 14.4
## 10 3.27 11.2
## percent_single_parent_households percent_severe_housing_problems overcrowding
## <dbl> <dbl> <dbl>
## 1 37.1 14.3 0.463
## 2 33.4 12.3 3.51
## 3 45.9 15.1 2.10
## 4 23.8 14.0 1.46
## 5 29.5 18.0 0.740
## 6 38.3 15.4 5.65
## 7 31.0 18.1 5.37
## 8 66.4 12.8 2.37
## 9 26.2 10.5 0.904
## 10 26.7 12.3 1.88
## percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
## <dbl> <dbl> <dbl>
## 1 15.8 15.2 36.1
## 2 11.4 15.1 32.4
## 3 15.9 14.1 36.8
## 4 7.9 12 26.3
## 5 8.4 17.5 31.9
## 6 24.3 19.1 39.5
## 7 7.7 8 31.0
## 8 13.2 24.7 41.1
## 9 11 11.7 30.1
## 10 8.5 8.3 34.7
## percent_uninsured_2 median_household_income
## <dbl> <dbl>
## 1 15.9 42412
## 2 14.0 40484
## 3 19.4 42879
## 4 11.1 66827
## 5 12.3 40395
## 6 29.6 35156
## 7 13.8 70199
## 8 18.7 33392
## 9 10.7 55167
## 10 8.46 62877
## average_traffic_volume_per_meter_of_major_roadways percent_homeowners
## <dbl> <dbl>
## 1 11.6 76.3
## 2 63.7 70.8
## 3 60.0 67.9
## 4 277. 68.4
## 5 45.8 60.0
## 6 16.7 68.6
## 7 490. 65.2
## 8 150. 61.7
## 9 53.4 68.2
## 10 113. 77.2
## population_2 percent_less_than_18_years_of_age percent_65_and_over
## <dbl> <dbl> <dbl>
## 1 24541 20.1 21.8
## 2 62190 25.8 15.3
## 3 32412 20.5 23.6
## 4 469966 23.8 14.4
## 5 25339 18.4 14.8
## 6 22082 26.6 15.9
## 7 511868 26.5 10.5
## 8 31192 20.1 18.8
## 9 31511 23.7 18.2
## 10 102811 20.0 20.4
## percent_black percent_asian percent_hispanic percent_female percent_rural
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 27.5 0.412 1.54 51.6 78.6
## 2 17.9 0.320 2.73 51.2 51.7
## 3 28.0 0.781 9.34 51.2 100
## 4 1.24 2.81 8.31 49.9 5.47
## 5 2.85 2.28 2.57 51.9 37.9
## 6 0.534 0.802 6.82 50.1 83.3
## 7 3.19 4.37 40.4 49.5 3.62
## 8 52.4 0.513 11.3 47.9 37.2
## 9 0.996 1.33 10.9 50.2 22.5
## 10 1.60 0.875 7.11 50.8 53.7
## # … with 1,436 more rows
Summarize again
summary(county_data)
## fips admin2 province_state country_region
## Min. : 1001 Length:1446 Length:1446 Length:1446
## 1st Qu.:17186 Class :character Class :character Class :character
## Median :28156 Mode :character Mode :character Mode :character
## Mean :29455
## 3rd Qu.:42048
## Max. :56039
## last_update lat long_
## Min. :2020-04-04 23:34:21 Min. :19.60 Min. :-159.60
## 1st Qu.:2020-04-04 23:34:21 1st Qu.:33.96 1st Qu.: -94.52
## Median :2020-04-04 23:34:21 Median :38.02 Median : -86.48
## Mean :2020-04-04 23:34:21 Mean :37.71 Mean : -89.73
## 3rd Qu.:2020-04-04 23:34:21 3rd Qu.:41.39 3rd Qu.: -81.21
## Max. :2020-04-04 23:34:21 Max. :64.81 Max. : -68.65
## confirmed deaths recovered active
## Min. : 5.0 Min. : 0.000 Min. :0 Min. :0
## 1st Qu.: 9.0 1st Qu.: 0.000 1st Qu.:0 1st Qu.:0
## Median : 20.0 Median : 0.000 Median :0 Median :0
## Mean : 207.2 Mean : 4.854 Mean :0 Mean :0
## 3rd Qu.: 66.0 3rd Qu.: 2.000 3rd Qu.:0 3rd Qu.:0
## Max. :63306.0 Max. :1905.000 Max. :0 Max. :0
## combined_key state county
## Length:1446 Length:1446 Length:1446
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
## Min. : 8.121 Min. : 5.909 Min. :12.40
## 1st Qu.:14.390 1st Qu.:14.899 1st Qu.:29.10
## Median :17.010 Median :17.143 Median :32.90
## Mean :17.594 Mean :17.151 Mean :32.39
## 3rd Qu.:20.398 3rd Qu.:19.365 3rd Qu.:36.20
## Max. :38.887 Max. :27.775 Max. :51.00
## percent_with_access_to_exercise_opportunities percent_excessive_drinking
## Min. : 0.00 Min. : 7.81
## 1st Qu.: 59.95 1st Qu.:15.68
## Median : 74.71 Median :18.03
## Mean : 71.15 Mean :17.91
## 3rd Qu.: 85.97 3rd Qu.:20.01
## Max. :100.00 Max. :28.62
## percent_uninsured percent_some_college percent_unemployed
## Min. : 2.263 Min. :21.14 Min. : 1.582
## 1st Qu.: 6.754 1st Qu.:53.21 1st Qu.: 3.252
## Median : 9.937 Median :61.19 Median : 3.870
## Mean :10.592 Mean :60.83 Mean : 4.071
## 3rd Qu.:13.527 3rd Qu.:68.72 3rd Qu.: 4.690
## Max. :31.208 Max. :90.34 Max. :18.092
## percent_children_in_poverty percent_single_parent_households
## Min. : 2.50 Min. : 9.43
## 1st Qu.:12.82 1st Qu.:27.07
## Median :18.40 Median :32.96
## Mean :19.46 Mean :33.83
## 3rd Qu.:24.50 3rd Qu.:38.93
## Max. :55.00 Max. :80.00
## percent_severe_housing_problems overcrowding percent_adults_with_diabetes
## Min. : 6.562 Min. : 0.000 Min. : 1.80
## 1st Qu.:12.267 1st Qu.: 1.379 1st Qu.: 9.10
## Median :14.439 Median : 1.971 Median :11.30
## Mean :15.082 Mean : 2.437 Mean :11.75
## 3rd Qu.:16.992 3rd Qu.: 2.887 3rd Qu.:13.90
## Max. :33.391 Max. :14.489 Max. :34.10
## percent_food_insecure percent_insufficient_sleep percent_uninsured_2
## Min. : 3.40 Min. :23.03 Min. : 2.683
## 1st Qu.:10.70 1st Qu.:31.42 1st Qu.: 7.865
## Median :12.70 Median :34.02 Median :12.027
## Mean :13.25 Mean :33.88 Mean :12.786
## 3rd Qu.:15.20 3rd Qu.:36.54 3rd Qu.:16.572
## Max. :33.50 Max. :46.71 Max. :42.397
## median_household_income average_traffic_volume_per_meter_of_major_roadways
## Min. : 25385 Min. : 0.00
## 1st Qu.: 46994 1st Qu.: 53.09
## Median : 54317 Median : 104.63
## Mean : 57600 Mean : 200.72
## 3rd Qu.: 64775 3rd Qu.: 206.78
## Max. :140382 Max. :4444.12
## percent_homeowners population_2 percent_less_than_18_years_of_age
## Min. :24.13 Min. : 2887 Min. : 7.069
## 1st Qu.:64.36 1st Qu.: 36275 1st Qu.:20.326
## Median :69.96 Median : 75382 Median :22.182
## Mean :68.99 Mean : 201689 Mean :22.204
## 3rd Qu.:74.77 3rd Qu.: 179982 3rd Qu.:24.019
## Max. :89.76 Max. :10105518 Max. :35.447
## percent_65_and_over percent_black percent_asian percent_hispanic
## Min. : 7.722 Min. : 0.1286 Min. : 0.06245 Min. : 0.7952
## 1st Qu.:14.913 1st Qu.: 1.6168 1st Qu.: 0.68228 1st Qu.: 2.9451
## Median :17.225 Median : 5.6397 Median : 1.22863 Median : 5.6100
## Mean :17.512 Mean :12.4056 Mean : 2.40009 Mean :10.0338
## 3rd Qu.:19.598 3rd Qu.:17.4904 3rd Qu.: 2.66813 3rd Qu.:11.1199
## Max. :57.587 Max. :81.9544 Max. :42.95231 Max. :96.3595
## percent_female percent_rural
## Min. :34.63 Min. : 0.00
## 1st Qu.:50.00 1st Qu.: 17.11
## Median :50.65 Median : 36.97
## Mean :50.46 Mean : 40.12
## 3rd Qu.:51.35 3rd Qu.: 60.04
## Max. :56.87 Max. :100.00
If there are variables with missing value for many counties, we go back and remove those variables from consideration.
Let’s create a final data frame for analysis.
county_data <- county_data %>%
mutate(state = as.factor(state)) %>%
select(county, confirmed, deaths, state, percent_fair_or_poor_health:percent_rural)
summary(county_data)
## county confirmed deaths state
## Length:1446 Min. : 5.0 Min. : 0.000 Georgia : 96
## Class :character 1st Qu.: 9.0 1st Qu.: 0.000 Texas : 80
## Mode :character Median : 20.0 Median : 0.000 North Carolina: 63
## Mean : 207.2 Mean : 4.854 Mississippi : 61
## 3rd Qu.: 66.0 3rd Qu.: 2.000 Indiana : 58
## Max. :63306.0 Max. :1905.000 Ohio : 57
## (Other) :1031
## percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
## Min. : 8.121 Min. : 5.909 Min. :12.40
## 1st Qu.:14.390 1st Qu.:14.899 1st Qu.:29.10
## Median :17.010 Median :17.143 Median :32.90
## Mean :17.594 Mean :17.151 Mean :32.39
## 3rd Qu.:20.398 3rd Qu.:19.365 3rd Qu.:36.20
## Max. :38.887 Max. :27.775 Max. :51.00
##
## percent_with_access_to_exercise_opportunities percent_excessive_drinking
## Min. : 0.00 Min. : 7.81
## 1st Qu.: 59.95 1st Qu.:15.68
## Median : 74.71 Median :18.03
## Mean : 71.15 Mean :17.91
## 3rd Qu.: 85.97 3rd Qu.:20.01
## Max. :100.00 Max. :28.62
##
## percent_uninsured percent_some_college percent_unemployed
## Min. : 2.263 Min. :21.14 Min. : 1.582
## 1st Qu.: 6.754 1st Qu.:53.21 1st Qu.: 3.252
## Median : 9.937 Median :61.19 Median : 3.870
## Mean :10.592 Mean :60.83 Mean : 4.071
## 3rd Qu.:13.527 3rd Qu.:68.72 3rd Qu.: 4.690
## Max. :31.208 Max. :90.34 Max. :18.092
##
## percent_children_in_poverty percent_single_parent_households
## Min. : 2.50 Min. : 9.43
## 1st Qu.:12.82 1st Qu.:27.07
## Median :18.40 Median :32.96
## Mean :19.46 Mean :33.83
## 3rd Qu.:24.50 3rd Qu.:38.93
## Max. :55.00 Max. :80.00
##
## percent_severe_housing_problems overcrowding percent_adults_with_diabetes
## Min. : 6.562 Min. : 0.000 Min. : 1.80
## 1st Qu.:12.267 1st Qu.: 1.379 1st Qu.: 9.10
## Median :14.439 Median : 1.971 Median :11.30
## Mean :15.082 Mean : 2.437 Mean :11.75
## 3rd Qu.:16.992 3rd Qu.: 2.887 3rd Qu.:13.90
## Max. :33.391 Max. :14.489 Max. :34.10
##
## percent_food_insecure percent_insufficient_sleep percent_uninsured_2
## Min. : 3.40 Min. :23.03 Min. : 2.683
## 1st Qu.:10.70 1st Qu.:31.42 1st Qu.: 7.865
## Median :12.70 Median :34.02 Median :12.027
## Mean :13.25 Mean :33.88 Mean :12.786
## 3rd Qu.:15.20 3rd Qu.:36.54 3rd Qu.:16.572
## Max. :33.50 Max. :46.71 Max. :42.397
##
## median_household_income average_traffic_volume_per_meter_of_major_roadways
## Min. : 25385 Min. : 0.00
## 1st Qu.: 46994 1st Qu.: 53.09
## Median : 54317 Median : 104.63
## Mean : 57600 Mean : 200.72
## 3rd Qu.: 64775 3rd Qu.: 206.78
## Max. :140382 Max. :4444.12
##
## percent_homeowners population_2 percent_less_than_18_years_of_age
## Min. :24.13 Min. : 2887 Min. : 7.069
## 1st Qu.:64.36 1st Qu.: 36275 1st Qu.:20.326
## Median :69.96 Median : 75382 Median :22.182
## Mean :68.99 Mean : 201689 Mean :22.204
## 3rd Qu.:74.77 3rd Qu.: 179982 3rd Qu.:24.019
## Max. :89.76 Max. :10105518 Max. :35.447
##
## percent_65_and_over percent_black percent_asian percent_hispanic
## Min. : 7.722 Min. : 0.1286 Min. : 0.06245 Min. : 0.7952
## 1st Qu.:14.913 1st Qu.: 1.6168 1st Qu.: 0.68228 1st Qu.: 2.9451
## Median :17.225 Median : 5.6397 Median : 1.22863 Median : 5.6100
## Mean :17.512 Mean :12.4056 Mean : 2.40009 Mean :10.0338
## 3rd Qu.:19.598 3rd Qu.:17.4904 3rd Qu.: 2.66813 3rd Qu.:11.1199
## Max. :57.587 Max. :81.9544 Max. :42.95231 Max. :96.3595
##
## percent_female percent_rural
## Min. :34.63 Min. : 0.00
## 1st Qu.:50.00 1st Qu.: 17.11
## Median :50.65 Median : 36.97
## Mean :50.46 Mean : 40.12
## 3rd Qu.:51.35 3rd Qu.: 60.04
## Max. :56.87 Max. :100.00
##
Display the 10 counties with highest CFR.
county_data %>%
mutate(cfr = deaths / confirmed) %>%
select(county, state, confirmed, deaths, cfr) %>%
arrange(desc(cfr)) %>%
top_n(10)
## Selecting by cfr
## # A tibble: 18 x 5
## county state confirmed deaths cfr
## <chr> <fct> <dbl> <dbl> <dbl>
## 1 Emmet Michigan 7 2 0.286
## 2 Grand Traverse Michigan 12 3 0.25
## 3 Toole Montana 12 3 0.25
## 4 Fayette Indiana 14 3 0.214
## 5 Concordia Louisiana 5 1 0.2
## 6 Harrison Texas 5 1 0.2
## 7 Huntington Indiana 5 1 0.2
## 8 Isabella Michigan 10 2 0.2
## 9 McDuffie Georgia 5 1 0.2
## 10 Navarro Texas 5 1 0.2
## 11 Orange Indiana 5 1 0.2
## 12 Perry Pennsylvania 5 1 0.2
## 13 Randolph Indiana 5 1 0.2
## 14 Rockingham North Carolina 5 1 0.2
## 15 Seneca Ohio 5 1 0.2
## 16 Toombs Georgia 5 1 0.2
## 17 Vigo Indiana 10 2 0.2
## 18 Washington Alabama 5 1 0.2
Write final data into a csv file for future use.
write_csv(county_data, "covid19-county-data-20200404.csv.gz")
Read and run above code to generate a data frame county_data
that includes county-level COVID-19 confirmed cases and deaths, demographic, and health related information.
What assumptions of CFR might be violated by defining CFR as deaths/confirmed
from this data set? With acknowledgement of these severe limitations, we continue to use deaths/confirmed
as a very rough proxy of CFR.
What assumptions of logistic regression may be violated by this data set?
Run a logistic regression, using variables state
, …, percent_rural
as predictors.
Interpret the regression coefficients of 3 significant predictors with p-value <0.01.
Apply analysis of deviance to (1) evaluate the goodness of fit of the model and (2) compare the model to the intercept-only model.
Perform analysis of deviance to evaluate the significance of each predictor. Display the 10 most significant predictors.
Construct confidence intervals of regression coefficients.
Plot the deviance residuals against the fitted values. Are there potential outliers?
Plot the half-normal plot. Are there potential outliers in predictor space?
Find the best sub-model using the AIC criterion.
Find the best sub-model using the lasso with cross validation.