QRMIAS: Eighth meeting

Quantitative Research Methods - Introduction to Applied Statistics

David Sichinava
December 6, 2019

Eighth meeting

Today's topic

  • Linear regression

Practical :

  • Todorov et. al. (2005) “Inferences of competence from faces predict election outcomes.” Science, vol. 308, no. 10 (June), pp. 1623–1626.

  • How can one predict election results based on facial appearance?

Variables:

Variable Description
congress Session of Congress
year Year of the election
state State of the election
winner Name of the winner
loser Name of the runner-up
w.party Party of the winner
l.party Party of the loser
d.votes Number of votes for the Democratic candidate
r.votes Number of votes for the Republican candidate
d.comp Competence measure for the Democratic candidate
r.comp Competence measure for the Republican candidate

Data:

## Read your data
## Download it from here: https://goo.gl/xYSjCr
face <- read.csv("face.csv")
## two-party vote share for Democrats and Republicans
face$d.share <- face$d.votes / (face$d.votes + face$r.votes)
face$r.share <- face$r.votes / (face$d.votes + face$r.votes)
face$diff.share <- face$d.share - face$r.share

Correlation:

  • What is the correlation between perceived competence and vote share differential?
cor(face$d.comp, face$diff.share)

Correlation:

library(ggplot2)
ggplot(face, aes(x=d.comp, y=diff.share)) +
         geom_point(aes(color=w.party))+
        labs(title="Facial competence and vote share",
             x="Competence scores for Democrats",
             y="Competence scores for Republicans")+
      scale_color_manual(name="Winning party",
                           values=c("blue", "red"))

Linear relationship:

\( Y = \alpha + \beta X + \epsilon \), where \( \alpha \) is an intercept, \( X \) independent (explanatory) variable, \( \beta \) - regression coefficientS, \( \epsilon \) error term.

Linear relationship:

Drawing

Linear relationship:

  • All models are wrong, but some are useful;
  • Our goal is to assess data-generating process
  • Therefore, in all models, we present estimated coefficients

Linear relationship:

\( \hat{Y} = \hat{\alpha}+\hat{\beta} x \), where \( x \) is some value of \( X \);

Linear relationship:

\( \hat{\epsilon} = Y-\hat{Y} \), where \( x \) is some value of \( X \);

Linear relationship:

fit <- lm(diff.share ~ d.comp, data = face) # fit the model
fit

Regression plot:

ggplot(face, aes(x=d.comp, y=diff.share)) +
    geom_point(aes(color=w.party))+
    geom_smooth(method='lm')+
    labs(title="Facial competence and vote share",
        x="Competence scores for Democrats",
        y="Competence scores for Republicans")+
    scale_color_manual(name="Winning party",
        values=c("blue", "red"))

Least Square Method

  • The best theoretical line
  • We select those paramenters which produce the least errors
  • Root-mean-squared error
    • \( RMSE = \sqrt{\frac{1}{n}SSR} \)

Least Square Method

epsilon.hat <- resid(fit) # Residuals
sqrt(mean(epsilon.hat^2)) # RMSE

Coefficients:

\( \hat{\alpha}=\bar{Y}-\hat{\beta}\bar{X} \)

Beta-coefficients:

\( \beta = corr(X, Y) * \frac{SD(Y)}{SD(X)} \)

Assumptions: no outliers

Drawing

Assumptions: no outliers

outlierTest(fit)
qqPlot(fit, main="QQ Plot")
leveragePlots(fit)

Assumptions: no influential Observations

# added variable plots 
av.Plots(fit)
# Cook's D plot
# identify D values > 4/(n-k-1) 
cutoff <- 4/((nrow(face)-length(fit$coefficients)-2)) 
plot(fit, which=4, cook.levels=cutoff)
# Influence Plot 
influencePlot(fit,  id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

Assumptions: normality of data

  • MASS library
# Normality of Residuals
# qq plot for studentized resid
qqPlot(fit, main="QQ Plot")
# distribution of studentized residuals
library(MASS)
sresid <- studres(fit) 
hist(sresid, freq=FALSE, 
   main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit)

Assumptions: homoscedasticity

# Evaluate homoscedasticity
# non-constant error variance test
ncvTest(fit)
# plot studentized residuals vs. fitted values 
spreadLevelPlot(fit)

Assumptions: no multicollinearity

vif(fit) # variance inflation factors 
sqrt(vif(fit)) > 2 # problem?

Assumptions: no non-linearity

# Evaluate Nonlinearity
# component + residual plot 
crPlots(fit)
# Ceres plots 
ceresPlots(fit)

Assumptions: independence of observations

durbinWatsonTest(fit)