7.1 Running models
#Only first variables from each category
model1 <- lm(pctGOP~median_rent+median_age+median_income+perc_hs, data = metadata)
#Only second variables from each category
model2 <- lm(pctGOP~perc_rent+perc_white+perc_below_pov+perc_doc, data = metadata)
#All variables
model3 <- lm(pctGOP~median_rent+median_age+median_income+perc_hs+perc_rent+perc_white+perc_below_pov+perc_doc, data = metadata)
#Only perc_rent, perc_below_pov, perc_hs
model4 <- lm(pctGOP~perc_rent+perc_below_pov+perc_hs, data = metadata)stargazer(model1, model2, model3, model4, 
          type = "html", 
          report=('vc*p'),
          keep.stat = c("n","rsq","adj.rsq"), 
          notes = "<em>*p<0.1;**p<0.05;***p<0.01</em>", 
          notes.append = FALSE,
          model.numbers = FALSE, 
          column.labels = c("(1)","(2)", "(3)", "(4)"))| Dependent variable: | ||||
| pctGOP | ||||
| (1) | (2) | (3) | (4) | |
| median_rent | -0.0005*** | -0.0002*** | ||
| p = 0.000 | p = 0.000 | |||
| median_age | 0.004*** | -0.003*** | ||
| p = 0.000 | p = 0.000 | |||
| median_income | 0.00001*** | -0.00000*** | ||
| p = 0.000 | p = 0.001 | |||
| perc_hs | 0.003*** | 0.003*** | 0.005*** | |
| p = 0.00000 | p = 0.000 | p = 0.000 | ||
| perc_rent | -0.010*** | -0.008*** | -0.022*** | |
| p = 0.000 | p = 0.000 | p = 0.000 | ||
| perc_white | 0.005*** | 0.005*** | ||
| p = 0.000 | p = 0.000 | |||
| perc_below_pov | 0.005*** | -0.004*** | -0.002*** | |
| p = 0.000 | p = 0.000 | p = 0.00002 | ||
| perc_doc | -0.058*** | -0.036*** | ||
| p = 0.000 | p = 0.000 | |||
| Constant | 0.632*** | 0.298*** | 0.783*** | 0.875*** | 
| p = 0.000 | p = 0.000 | p = 0.000 | p = 0.000 | |
| Observations | 3,108 | 3,108 | 3,108 | 3,108 | 
| R2 | 0.321 | 0.531 | 0.614 | 0.214 | 
| Adjusted R2 | 0.320 | 0.530 | 0.613 | 0.213 | 
| Note: | *p<0.1;**p<0.05;***p<0.01 | |||
htmlTable(round(cor(metadata[,c("median_rent", "median_age", "median_income", "perc_hs", "perc_rent", "perc_white", "perc_below_pov", "perc_doc")]), digits = 3), caption = 'Multicollinearity Test:', css.cell = 'padding: 0px 2px 0px; font-size: 10px;', css.header = "font-size: 10px; font-weight: normal;")| Multicollinearity Test: | ||||||||
| median_rent | median_age | median_income | perc_hs | perc_rent | perc_white | perc_below_pov | perc_doc | |
|---|---|---|---|---|---|---|---|---|
| median_rent | 1 | -0.237 | 0.666 | -0.294 | 0.187 | -0.168 | -0.394 | 0.453 | 
| median_age | -0.237 | 1 | -0.034 | -0.185 | -0.331 | 0.345 | -0.195 | -0.25 | 
| median_income | 0.666 | -0.034 | 1 | -0.576 | -0.039 | 0.164 | -0.746 | 0.308 | 
| perc_hs | -0.294 | -0.185 | -0.576 | 1 | -0.008 | -0.309 | 0.601 | -0.318 | 
| perc_rent | 0.187 | -0.331 | -0.039 | -0.008 | 1 | -0.333 | 0.258 | 0.32 | 
| perc_white | -0.168 | 0.345 | 0.164 | -0.309 | -0.333 | 1 | -0.466 | -0.073 | 
| perc_below_pov | -0.394 | -0.195 | -0.746 | 0.601 | 0.258 | -0.466 | 1 | -0.128 | 
| perc_doc | 0.453 | -0.25 | 0.308 | -0.318 | 0.32 | -0.073 | -0.128 | 1 | 
In order to make an accurate estimate, we need to check our four assumptions: Linearity of parameters, Independent and Random Sampling, No perfect Multicollinearity, and Zero Conditional Mean. For the first assumption, we tried to determine by comparing the linear fits of the scatter plot with a nonlinear fit (using method = "loess"). For most of the variables (excluding perc_doc), we see that most nonlinear curves actually overfit the data. So it is inconclusive whether the first assumption is violated. Because our data was taken from the census, and used as a cross sectional data, we can view this data as a random sample, meeting our second assumption. In addition, when we ran a perfect multicollinearity test, we found that there is no perfect multicollinearity, so we know our third assumption is also accurate. However, we cannot assume that the Zero Conditional Mean is met in these models. Because the number of explanatory variables for the percent of GOP votes is so large, there is no feasible way to control for all of the explanatory variables within our eight variables. For instance, data such as % Male or % with kids might have an impact that we are unable to represent.
Given that the Zero Conditional Mean assumption is likely to not be true, we can say that we should not interpret any of the results as causal.