Election Night Prediction Modeling using PL/R in Postgres

Joe Conway
PostgreSQL

I was sent a link to a tweet regarding election night forecasting using R, and of course the default question was ... could it be run under PL/R inside Postgres? Like almost everything at Crunchy Data, we believe all things are better with Postgres. So I decided to give it a shot, and a bit of a database spin as it were. Since I had to get this blog done quickly, it is going to be mostly code -- sorry about that!  

The code in this blog (please see a small but important correction at the end) derived from the GitHub gist mentioned in the tweet. I modified it to work within PL/R and show how to save the input and corresponding results to a PostgreSQL table.

Before attempting to run this, install the required libraries either as root or the user under which PostgreSQL runs on your database server (usually "postgres"):

install.packages(c("tidyverse","mvtnorm","politicaldata"))
First step, create plr_modules table. This is described in the PL/R User Guide.
CREATE TABLE plr_modules
(
  modseq int4,
  modsrc text
);
Next we will insert much of the core code from the gist as a row in that table. It has been slightly modified from the original, but as far as I can tell still produces that same results. It will be loaded into the R environment that PL/R sees when you start your SQL session:
INSERT INTO plr_modules VALUES (42, $m$
library(tidyverse)
library(mvtnorm)
library(politicaldata)

# functions first ---------------------------------------------------------
logit <- function(x) log(x/(1-x))
inv_logit <- function(x) 1/(1 + exp(-x))
draw_samples <- function(biden_states = NULL, trump_states = NULL, states = NULL, 
                         upper_biden = NULL, lower_biden = NULL, print_acceptance = FALSE, target_nsim = 1000){
  sim <- matrix(NA, nr = 1, nc = length(mu))
  n <- 0
  while(nrow(sim) < target_nsim){
    # randomly sample from the posterior distribution and reject when constraints are not met
    n <- n + 1
    proposals <- inv_logit(rmvnorm(1e5, mu, Sigma, method = "svd")) # "DC" is pretty much uncorrelated
    colnames(proposals) <- names(mu)
    if (!is.null(biden_states)) proposals[which(proposals[,biden_states] < .5)] <- NA
    if (!is.null(  trump_states)) proposals[which(proposals[,  trump_states] > .5)] <- NA
    if (!is.null(        states)){
      for (s in states){
        proposals[which(proposals[, s] > upper_biden[s] | 
                          proposals[, s] < lower_biden[s])] <- NA
      }
    }
    reject <- apply(proposals, 1, function(x) any(is.na(x)))
    sim <- rbind(sim, proposals[!reject,])
    if (nrow(sim) < target_nsim & nrow(sim)/(nrow(proposals)*n) < 1-99.99/100){
      stop(paste("rmvnorm() is working hard... but more than 99.99% of the samples are rejected; you should relax some contraints.", sep = ""))
    }
  }
  return(list("matrix" = sim[-1,], "acceptance_rate" = nrow(sim)/(nrow(proposals)*n)))
}
update_prob <- function(biden_states = NULL,
                        trump_states = NULL,
                        bsl_states = NULL,
                        bsl_lower = NULL,
                        bsl_upper = NULL,
                        target_nsim = 1000){
  states <- bsl_states
  lower_biden <- bsl_lower/100
  upper_biden <- bsl_upper/100
  sim <- draw_samples(biden_states = biden_states, trump_states = trump_states, states = states, 
                      upper_biden = upper_biden, lower_biden = lower_biden, 
                      target_nsim = target_nsim)
  ev_dist <- (sim[["matrix"]] > .5) %*% ev
  state_win <- colMeans(sim[["matrix"]] > .5)
  p <- mean(ev_dist >= 270)
  sd <- sqrt(p*(1-p)/length(ev_dist))
  bwbs <- t(round(100*state_win,1))
  return(data.frame(bwbs, round(100*p,1), length(ev_dist), round(sd*100,1)))
}

# read data ---------------------------------------------------------------
# get simulations
sim_forecast <- read_csv('https://cdn.economistdatateam.com/us-2020-forecast/data/president/electoral_college_simulations.csv') 

# check initial parameters
nrow(sim_forecast)
mean(sim_forecast$dem_ev > 269)

# select relevant columns and make the data frae into a matrix
sim_forecast <- sim_forecast %>% select(4:ncol(.))
sim_forecast <- list(sim_forecast,sim_forecast,sim_forecast,sim_forecast,sim_forecast) %>% bind_rows  %>% as.matrix


# this bit is really hacky
# it make the simulations a little less correlated by add a bunch of random noise
# this helps our live model react to really implausible events that could happen on election night
# but will have the result of pushing the initial pre-results forecasts back toward 50-50
sim_forecast <- sim_forecast + 
                rnorm(nrow(sim_forecast), 0, 0.01) +  # national error component
                replicate(ncol(sim_forecast),rnorm(nrow(sim_forecast),0,0.02)) # state

sim_forecast <- ifelse(sim_forecast <= 0, 0.0001, sim_forecast)
sim_forecast <- ifelse(sim_forecast >= 1, 0.99999, sim_forecast)

sim_forecast <- as_tibble(sim_forecast)

# now, get electoral votes in each state 
# and make sure they're in the right order
#ev_state <- read_csv('data/state_evs.csv')$ev
#names(ev_state) <- read_csv('data/state_evs.csv')$state
ev_state <- read_csv('https://raw.githubusercontent.com/TheEconomist/us-potus-model/master/data/2012.csv')$ev
names(ev_state) <- read_csv('https://raw.githubusercontent.com/TheEconomist/us-potus-model/master/data/2012.csv')$state
ev <- ev_state[colnames(sim_forecast)] 

# check that the EVs and state forecasts add up to the right amounts
sim_evs <- apply(sim_forecast,
      1,
      function(x){
        sum((x > 0.5 )* ev)
      })

#hist(sim_evs,breaks=538) 
median(sim_evs)
mean(sim_evs)
mean(sim_evs > 269)
enframe(prop.table(table(sim_evs)),'dem_ev','pct') %>% arrange(desc(pct))


# adding ME1 and ME2, NE1 NE2 to sim_forecast matrix and ev vector
# we do this by adding the average district-level two-party dem presidential vote, relative
# to the state-level dem two-party vote, back to to our state-level forecast

# first, split up EVs
ev["ME"] <- 2
ev["NE"] <- 2
ev <- c(ev, "ME1" = 1, "ME2" = 1, "NE1" = 1, "NE2" = 1, "NE3" = 1)
sum(ev)

# create simulations for ME and NE districts
me_ne_leans <- politicaldata::pres_results_by_cd %>% filter(year >= 2012, state_abb %in% c("ME","NE")) %>%
  select(-other) %>% 
  rename(state = state_abb) %>%
  group_by(year,state) %>%
  mutate(sum_pct = dem + rep,
         total_votes = total_votes * sum_pct,
         dem = dem /sum_pct,
         rep = rep/sum_pct) %>%
  mutate(dem_vote_state = sum(dem * total_votes) / sum(total_votes)) %>%
  mutate(dem_cd_lean = dem - dem_vote_state) %>%
  group_by(state,district) %>%
  summarise(dem_cd_lean = weighted.mean(dem_cd_lean, c(0.3,0.7)))

# bind new simulation columns for the congressional districts, based on the above
sim_forecast <- bind_cols(
  sim_forecast, 
  tibble(
    ME1 = sim_forecast %>% pull(ME) +
      rnorm(nrow(sim_forecast), 
            me_ne_leans[me_ne_leans$state == "ME" & me_ne_leans$district == 1,]$dem_cd_lean, 
            .0075),
    ME2 = sim_forecast %>% pull(ME) +
      rnorm(nrow(sim_forecast), 
            me_ne_leans[me_ne_leans$state == "ME" & me_ne_leans$district == 2,]$dem_cd_lean, 
            .0075),
    
    NE1 = sim_forecast %>% pull(NE) +
      rnorm(nrow(sim_forecast), 
            me_ne_leans[me_ne_leans$state == "NE" & me_ne_leans$district == 1,]$dem_cd_lean, 
            .0075),
    NE2 = sim_forecast %>% pull(NE) +
      rnorm(nrow(sim_forecast), 
            me_ne_leans[me_ne_leans$state == "NE" & me_ne_leans$district == 2,]$dem_cd_lean, 
            .0075),
    NE3 = sim_forecast %>% pull(NE) +
      rnorm(nrow(sim_forecast), 
            me_ne_leans[me_ne_leans$state == "NE" & me_ne_leans$district == 3,]$dem_cd_lean, 
            .0075) 
  )
)

sim_forecast


sim_evs <- apply(sim_forecast,
                 1,
                 function(x){
                   sum((x > 0.5 )* ev)
                 })

colMeans(sim_forecast > 0.5)

#hist(sim_evs,breaks=538)
median(sim_evs)
mean(sim_evs)
mean(sim_evs > 269)
enframe(prop.table(table(sim_evs)),'dem_ev','pct') %>% arrange(desc(pct))


# final data wrangling -- this stuff getes passed into the update_prob function
# mainly we just want to make sure everything is in the right order with the right names
ev <- ev[colnames(sim_forecast)]

Sigma <- cov(logit(sim_forecast)) 
Sigma

mu <- colMeans(logit(sim_forecast))
names(mu) <- colnames(sim_forecast)
$m$);
Now we need to either log out and back in, or simply reload the modules table into our environment:
SELECT reload_plr_modules();
Create a table for valid state codes:
CREATE TABLE states (code char(2) primary key);
INSERT INTO states VALUES
('AK'),('AL'),('AR'),('AZ'),('CA'),('CO'),('CT'),('DC'),('DE'),
('FL'),('GA'),('HI'),('IA'),('ID'),('IL'),('IN'),('KS'),('KY'),
('LA'),('MA'),('MD'),('ME'),('MI'),('MN'),('MO'),('MS'),('MT'),
('NC'),('ND'),('NE'),('NH'),('NJ'),('NM'),('NV'),('NY'),('OH'),
('OK'),('OR'),('PA'),('RI'),('SC'),('SD'),('TN'),('TX'),('UT'),
('VA'),('VT'),('WA'),('WI'),('WV'),('WY');
Then create a table to hold the inputs to the update_prob() function.
CREATE TABLE states_input
(
  code char(2) primary key REFERENCES states,
  called_for text check(called_for in ('biden','trump')),
  biden_scores_list float8[] check(array_length(biden_scores_list,1) = 2),
  check(called_for is null or biden_scores_list is null)
);
The way this is used is that as each state gets called for Biden or Trump, insert a row for that state using the appropriate two character state code (in all caps), and who it was called for (in lower case). You can also create a row where the called_for column is NULL but biden_scores_list is an array with lower and upper bounds on the Biden percentages. Now we create a special type for the output of the update_prob() function.
CREATE type results_out AS
(
  ak FLOAT8, al FLOAT8, ar FLOAT8, az FLOAT8, ca FLOAT8, co FLOAT8,
  ct FLOAT8, dc FLOAT8, de FLOAT8, fl FLOAT8, ga FLOAT8, hi FLOAT8,
  ia FLOAT8, id FLOAT8, il FLOAT8, "in" FLOAT8, ks FLOAT8, ky FLOAT8,
  la FLOAT8, ma FLOAT8, md FLOAT8, me FLOAT8, mi FLOAT8, mn FLOAT8,
  mo FLOAT8, ms FLOAT8, mt FLOAT8, nc FLOAT8, nd FLOAT8, ne FLOAT8,
  nh FLOAT8, nj FLOAT8, nm FLOAT8, nv FLOAT8, ny FLOAT8, oh FLOAT8,
  ok FLOAT8, "or" FLOAT8, pa FLOAT8, ri FLOAT8, sc FLOAT8, sd FLOAT8,
  tn FLOAT8, tx FLOAT8, ut FLOAT8, va FLOAT8, vt FLOAT8, wa FLOAT8,
  wi FLOAT8, wv FLOAT8, wy FLOAT8, me1 FLOAT8, me2 FLOAT8, ne1 FLOAT8,
  ne2 FLOAT8, ne3 FLOAT8, pr_biden_wins_pct FLOAT8, nsim INTEGER, se FLOAT8
);
And the update_prob() function itself.
CREATE OR REPLACE FUNCTION update_prob
(target_nsim integer = 1000) RETURNS results_out AS
$$
 conn <- dbConnect()

 sql_b <- "SELECT code FROM states_input WHERE called_for = 'biden'"
 sql_t <- "SELECT code FROM states_input WHERE called_for = 'trump'"
 sql_s <- "SELECT code, biden_scores_list[1] AS sll, biden_scores_list[2] AS slu FROM states_input WHERE biden_scores_list IS NOT NULL"

 bs <- dbGetQuery(conn, sql_b)
 biden_states <- bs$code

 ts <- dbGetQuery(conn, sql_t)
 trump_states <- ts$code

 bsl <- dbGetQuery(conn, sql_s)
 bsl_states <- bsl$code
 bsl_lower <- bsl$sll
 bsl_upper <- bsl$slu

 update_prob(biden_states,
             trump_states,
             bsl_states,
             bsl_lower,
             bsl_upper,
             target_nsim)
$$ LANGUAGE plr;
Now, finally we can see it in action:
-- raw prob, no constraints
TRUNCATE TABLE states_input;
\x auto
SELECT * FROM update_prob();
-[ RECORD 1 ]-----+-------
ak                | 11.6
al                | 0.1
ar                | 0
az                | 65.6
ca                | 100
co                | 97.5
ct                | 100
dc                | 98.7
de                | 99.9
fl                | 69.2
ga                | 53.9
hi                | 100
ia                | 43.6
id                | 0
il                | 99.8
in                | 1.2
ks                | 2.3
ky                | 0
la                | 1.1
ma                | 100
md                | 100
me                | 98.1
mi                | 90.5
mn                | 93
mo                | 6.5
ms                | 2.8
mt                | 6
nc                | 61.2
nd                | 0
ne                | 0.6
nh                | 91.7
nj                | 99.9
nm                | 97.4
nv                | 86.2
ny                | 100
oh                | 42.1
ok                | 0
or                | 99.8
pa                | 84.8
ri                | 100
sc                | 8.5
sd                | 0.2
tn                | 0.3
tx                | 34.6
ut                | 0.4
va                | 96.7
vt                | 100
wa                | 100
wi                | 89.6
wv                | 0
wy                | 0
me1               | 100
me2               | 54.5
ne1               | 3.7
ne2               | 77.9
ne3               | 0
pr_biden_wins_pct | 92
nsim              | 100000
se                | 0.1

-- For the sake of brevity, not going to paste
-- the query results for the following examples.
-- Left as an exercise for the reader.

-- biden wins florida
TRUNCATE TABLE states_input;
INSERT INTO states_input VALUES
('FL', 'biden', NULL);
SELECT * FROM update_prob();

-- trump wins florida
TRUNCATE TABLE states_input;
INSERT INTO states_input VALUES
('FL', 'trump', NULL);
SELECT * FROM update_prob();

-- trump wins fl and nc, biden wins va,
-- biden margin in MI can't be outside -10 or 10
TRUNCATE TABLE states_input;
INSERT INTO states_input VALUES
('VA', 'biden', NULL),
('FL', 'trump', NULL),
('NC', 'trump', NULL),
('MI',  NULL, ARRAY[45,55]);
SELECT * FROM update_prob();
And final-final thought -- if you want to preserve the input assumptions and corresponding results as the evening unfolds, you can do it like so:
CREATE TABLE results(inp states_input[], res results_out);
INSERT INTO results
 SELECT ARRAY(SELECT states_input FROM states_input), p
 FROM update_prob() AS p;

Here's to active results watching on election night!

CORRECTION: As is often the case, domain expertise is needed to ensure correct data modeling, and I for one lack domain expertise when it comes to political analysis. As such, I failed to accommodate the fact that Nebraska and Maine split their electoral votes up into multiple districts. Here is at least one way to fix the foregoing code to allow for that reality:

ALTER TABLE states ALTER COLUMN code TYPE varchar(3);
COPY results TO '/tmp/results.out';
DROP TABLE results;
ALTER TABLE states_input ALTER COLUMN code TYPE varchar(3);
CREATE TABLE results(inp states_input[], res results_out);
COPY results FROM '/tmp/results.out';
INSERT INTO states VALUES ('ME1'),('ME2'),('NE1'),('NE2'),('NE3');

Join the Discussion

Newsletter