Resources

Correcting for Spatial & Temporal Auto-Correlation in Panel Data

Darin Christensen and Thiemo Fetzer


tl;dr: Fast computation of standard errors that allows for serial and spatial auto-correlation.


Economists and political scientists often employ panel data that track units (e.g., firms or villages) over time. When estimating regression models using such data, we often need to be concerned about two forms of auto-correlation: serial (within units over time) and spatial (across nearby units). As Cameron and Miller (2013) note in their excellent guide to cluster-robust inference, failure to account for such dependence can lead to incorrect conclusions: “[f]ailure to control for within-cluster error correlation can lead to very misleadingly small standard errors…” (p. 4).

Conley (1999, 2008) develops one commonly employed solution. His approach allows for serial correlation over all (or a specified number of) time periods, as well as spatial correlation among units that fall within a certain distance of each other. For example, we can account for correlated disturbances within a particular village over time, as well as between that village and every other village within one hundred kilometers.

We provide a new function that allows R users to more easily estimate these corrected standard errors. (Solomon Hsiang (2010) provides code for STATA, which we used to test our estimates and benchmark speed.) Moreover using the excellent lfe, Rcpp, and RcppArmadillo packages (and Tony Fischetti’s Haversine distance function), our function is roughly 20 times faster than the STATA equivalent and can scale to handle panels with more units. (We have used it on panel data with over 100,000 units observed over 6 years.)

This demonstration employs data from Fetzer (2014), who uses a panel of U.S. counties from 1999-2012. The data and code can be downloaded here.


STATA Code:

We first use Hsiang’s STATA code to compute the corrected standard errors (spatHAC in the output below). This routine takes just over 25 seconds.

cd "~/Dropbox/ConleySEs/Data"
clear; use "new_testspatial.dta"

tab year, gen(yy_)
tab FIPS, gen(FIPS_)

timer on 1
ols_spatial_HAC EmpClean00 HDD yy_* FIPS_2-FIPS_362, lat(lat ) lon(lon ) t(year) p(FIPS) dist(500) lag(5) bartlett disp

# *-----------------------------------------------
# *    Variable |   OLS      spatial    spatHAC
# *-------------+---------------------------------
# *         HDD |   -0.669     -0.669     -0.669
# *             |    0.608      0.786      0.838

timer off 1
timer list 1
#  1:     24.8 /        3 =      8.2650

R Code:

Using the same data and options as the STATA code, we then estimate the adjusted standard errors using our new R function. This requires us to first estimate our regression model using the felm function from the lfe package.

# Loading sample data:
dta_file <- "~/Dropbox/ConleySEs/Data/new_testspatial.dta"
DTA <- data.table(read.dta(dta_file))
setnames(DTA, c("latitude", "longitude"), c("lat", "lon"))

# Loading R function to compute Conley SEs:
source("~/Dropbox/ConleySEs/ConleySEs_17June2015.R")

ptm <- proc.time()

# We use the felm() from the lfe package to estimate model with year and county fixed effects.
# Two important points:
# (1) We specify our latitude and longitude coordinates as the cluster variables, so that they are included in the output (m).
# (2) We specify keepCx = TRUE, so that the centered data is included in the output (m).

m <- felm(EmpClean00 ~ HDD - 1 | year + FIPS | 0 | lat + lon,
  data = DTA[!is.na(EmpClean00)], keepCX = TRUE)

coefficients(m) %>% round(3) # Same as the STATA result.
   HDD
-0.669

We then feed this model to our function, as well as the cross-sectional unit (county FIPS codes), time unit (year), geo-coordinates (lat and lon), the cutoff for serial correlation (5 years), the cutoff for spatial correlation (500 km), and the number of cores to use.

SE <- ConleySEs(reg = m,
    unit = "FIPS",
    time = "year",
    lat = "lat", lon = "lon",
    dist_fn = "SH", dist_cutoff = 500,
    lag_cutoff = 5,
    cores = 1,
    verbose = FALSE)

sapply(SE, sqrt) %>% round(3) # Same as the STATA results.
        OLS     Spatial Spatial_HAC
      0.608       0.786       0.837
proc.time() - ptm
   user  system elapsed
  1.249   0.015   1.318

Estimating the model and computing the standard errors requires just over 1 second, making it over 20 times faster than the comparable STATA routine.


R Using Multiple Cores:

Even with a single core, we realize significant speed improvements. However, the gains are even more dramatic when we employ multiple cores. Using 4 cores, we can cut the estimation of the standard errors down to around 0.4 seconds. (These replications employ the Haversine distance formula, which is more time-consuming to compute.)

pkgs <- c("rbenchmark", "lineprof")
invisible(sapply(pkgs, require, character.only = TRUE))

bmark <- benchmark(replications = 25,
  columns = c('replications','elapsed','relative'),
  ConleySEs(reg = m,
    unit = "FIPS", time = "year", lat = "lat", lon = "lon",
    dist_fn = "Haversine", lag_cutoff = 5, cores = 1, verbose = FALSE),
  ConleySEs(reg = m,
    unit = "FIPS", time = "year", lat = "lat", lon = "lon",
    dist_fn = "Haversine", lag_cutoff = 5, cores = 2, verbose = FALSE),
  ConleySEs(reg = m,
    unit = "FIPS", time = "year", lat = "lat", lon = "lon",
    dist_fn = "Haversine", lag_cutoff = 5, cores = 4, verbose = FALSE))
bmark %>% mutate(avg_eplased = elapsed / replications, cores = c(1, 2, 4))
  replications elapsed relative avg_eplased cores
1           25   23.48    2.095      0.9390     1
2           25   15.62    1.394      0.6249     2
3           25   11.21    1.000      0.4483     4

Given the prevalence of panel data that exhibits both serial and spatial dependence, we hope this function will be a useful tool for applied econometricians working in R.


Feedback Appreciated: Memory vs. Speed Tradeoff

This was Darin’s first foray into C++, so we welcome feedback on how to improve the code. In particular, we would appreciate thoughts on how to overcome a memory vs. speed tradeoff we encountered. (You can email me at darinc[at]stanford.edu.)

The most computationally intensive chunk of our code computes the distance from each unit to every other unit. To cut down on the number of distance calculations, we can fill the upper triangle of the distance matrix and then copy it to the lower triangle. With \(N\) units, this requires only \(N (N-1) /2\) distance calculations.

However, as the number of units grows, this distance matrix becomes too large to store in memory, especially when executing the code in parallel. (We tried to use a sparse matrix, but this was extremely slow to fill.) To overcome this memory issue, we can avoid constructing a distance matrix altogether. Instead, for each unit, we compute the vector of distances from that unit to every other unit. We then only need to store that vector in memory. While that cuts down on memory use, it requires us to make twice as many (\(N(N-1)\) distance calculations.

As the number of units grows, we are forced to perform more duplicate distance calculations to avoid memory constraints – an unfortunate tradeoff. (See the functions XeeXhC and XeeXhC_Lg in ConleySE.cpp.)


sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] graphics  grDevices utils     datasets  stats     methods
[7] base

other attached packages:
 [1] RcppArmadillo_0.4.600.4.0 Rcpp_0.11.6
 [3] geosphere_1.3-11          sp_1.0-17
 [5] lfe_2.3-1709              Matrix_1.1-5
 [7] knitr_1.9                 data.table_1.9.4
 [9] beepr_1.1                 dplyr_0.4.1
[11] ggplot2_1.0.0             foreign_0.8-62
[13] lubridate_1.3.3

loaded via a namespace (and not attached):
 [1] assertthat_0.1   audio_0.1-5      chron_2.3-45
 [4] colorspace_1.2-4 compiler_3.1.2   DBI_0.3.1
 [7] digest_0.6.8     evaluate_0.5.5   formatR_1.0
[10] Formula_1.2-0    grid_3.1.2       gtable_0.1.2
[13] htmltools_0.2.6  lattice_0.20-29  magrittr_1.5
[16] MASS_7.3-37      memoise_0.2.1    munsell_0.4.2
[19] parallel_3.1.2   plyr_1.8.1       proto_0.3-10
[22] reshape2_1.4.1   rmarkdown_0.5.1  scales_0.2.4
[25] stringi_0.4-1    stringr_1.0.0    tools_3.1.2
[28] xtable_1.7-4     yaml_2.1.13

Faster Two-Way Fixed Effects Estimator in R

In a current project, we need to run linear models on a large data frame (~70 million rows) with fixed effects for both place (e.g., grid-cell) and time (e.g., year) – a common specification for difference-in-differences models with multiple periods and groups. The base lm package in R choked on the task. The excellent lfe package by Gaure can estimate the models but requires long run times.

In attempt to speed up our estimation routines, I built a new function that uses the data.table and RcppEigen packages to partial out the fixed effects and then run OLS on the de-meaned data. I include the code below in hopes that it might be useful to others facing a similar task. Note: this assumes that your panel is balanced.

To start, we need to create some fake data for analysis:

T = 6; N = 10^4
fake_dt <- data.table(
	A = sample(c(0,1), N * T, replace = TRUE),
	B1 = sample(c(0,1), N * T, replace = TRUE),
	B2 = rnorm(N),
	year = rep(1980:(1979 + T), N),
	cell = rep(1:100, each = T))
TwoWayFE <- function(y, Xs, fe1, fe2, dt, cluster = FALSE) {
	pkgs <- c("data.table", "RcppEigen")
	sapply(pkgs, require, character.only = TRUE)

	keep_cols <- c(y, Xs, fe1, fe2)
	model_dt <- dt[, keep_cols, with = FALSE]; rm(keep_cols)
	model_dt <- na.omit(model_dt)

	num_Xs <- length(Xs)
	new_names <- c("y", paste0("x", 1:num_Xs), "fe1", "fe2")
	setnames(model_dt, 1:ncol(model_dt), new_names)

	# Sample Means:
	cols <- new_names[!grepl("fe", new_names)]
	model_dt[, paste0("mean_", cols) :=
		lapply(.SD, mean, na.rm = TRUE), .SDcols = cols]

	# Means by FE1:
	setkey(model_dt, fe1)
	model_dt[,
		paste0("mean_", cols, "_fe1") :=
			lapply(.SD, mean, na.rm = TRUE), .SDcols = cols, by = fe1]
	M <- length(unique(model_dt$fe1))

	# Means by FE2:
	setkey(model_dt, fe2)
	model_dt[,
		paste0("mean_", cols, "_fe2") :=
		lapply(.SD, mean, na.rm = TRUE), .SDcols = cols, by = fe2]
	Y <- length(unique(model_dt$fe2))

	# Demeaning:
	model_dt[, "y_tilde" := y - mean_y_fe2 - mean_y_fe1 + mean_y]

	g <- function(i) {paste0("x",i,"_tilde")}
	LHS <- sapply(1:num_Xs, g)

	f <- function(i) {
		paste0("x",i," - mean_x",i,"_fe2 - mean_x",i,"_fe1 + mean_x", i)
	}
	RHS <- paste0("list(",paste(sapply(1:num_Xs, f), collapse = ", "), ")")

	model_dt[, eval(LHS) := eval(parse(text = RHS))]

	x_cols <- grep("x\\d{1}_tilde", names(model_dt), value = TRUE)
	model_dt <- model_dt[, c("y_tilde", eval(x_cols), "fe1", "fe2"),
		with = FALSE]

	y <- model_dt$y_tilde
	X <- model_dt[, x_cols, with = FALSE]
	cluster_vec <- model_dt$fe1
	rm(model_dt)

	m <- RcppEigen::fastLm(X, y)
	names(m$coefficients) <- Xs

	##############################################
	DoF <- m$df.residual - (M - 1) - (Y - 1) - num_Xs + 1
	# No intercept in model.

	# SEs:
	if(cluster){
		N <- length(cluster_vec)
		K <- m$rank + (M - 1) + (Y - 1)  + 1

		dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
		est_fun <- residuals(m) * X

		dt <- data.table(est_fun, fe1 = cluster_vec)

		setkey(dt, fe1)
		dt <- dt[, lapply(.SD, sum), by = fe1][, 2:ncol(dt), with = FALSE]

		bread <- solve(crossprod(as.matrix(X))) * N
		meat <- crossprod(as.matrix(dt)) / N

		m$se <- as.numeric(sqrt(dfc * 1/N * diag(bread %*% meat %*% bread)))
		message("SEs Clustered on First FE")
	} else {
		m$se <- m$se * sqrt(m$df.residual / DoF)
		message("SEs Not Clustered")
	}

	# Correcting degrees of freedom:
	m$df.residual <- DoF

	return(m)
}

I then check equivalence of the estimates and benchmark this function against both the felm and lm functions.

# CHECKING EQUIVALENCE:
options(digits = 10)

pkgs <- c("RcppEigen", "data.table", "lfe", "testthat", "microbenchmark")
sapply(pkgs, require, character.only = TRUE)

t_dc <- system.time(
m <- TwoWayFE(y = "A", Xs = c("B1", "B2"),
		fe1 = "cell", fe2 = "year",
		dt = fake_dt,
		cluster = FALSE))

library(lfe)
t_felm <- system.time(
m_felm <- felm(A ~ B1 + B2 | factor(cell) + factor(year),
	data = fake_dt))

t_base <- system.time(
m_base <- lm(A ~ B1 + B2 + factor(cell) + factor(year),
	data = fake_dt))

# Comparing Results:
summary(m)$coefficients[1:2, 1:2]
#           Estimate     Std. Error
# B1 -0.002497459836 0.004086039606
# B2  0.001355440208 0.002022403917

summary(m_felm)$coefficients[1:2, 1:2]
#           Estimate     Std. Error
# B1 -0.002497459836 0.004086039606
# B2  0.001355440208 0.002022403917

summary(m_base)$coefficients[2:3, 1:2]
#           Estimate     Std. Error
# B1 -0.002497459836 0.004086039606
# B2  0.001355440208 0.002022403917

# Checking equivalence:
b1_dc <- summary(m)$coefficients[1,1]
b1_felm <- summary(m_felm)$coefficients[1,1]
expect_equivalent(b1_dc, b1_felm)

b1_lm <- summary(m_base)$coefficients[1,1]
expect_equivalent(b1_dc, b1_lm)

Here’s the punchline:

Benchmark

# BENCHMARKING:

# More time-consuming:
op <- microbenchmark(
	DC = TwoWayFE(y = "A", Xs = c("B1", "B2"), fe1 = "cell", fe2 = "year",
		dt = fake_dt, cluster = FALSE),
	FELM = felm(A ~ B1 + B2 | factor(cell) + factor(year), data = fake_dt),
	BASE = lm(A ~ B1 + B2 + factor(cell) + factor(year), data = fake_dt),
	times = 100L)

print(op)
# Unit: milliseconds
#  expr        min         mean      median        max neval
#    DC  44.697352  63.51898258  54.3942015 162.108703   100
#  FELM 502.789730 585.82423354 576.5496225 771.906130   100
#  BASE 696.929222 758.06220078 737.3307755 939.178530   100

ggplot(data = op, aes(x = expr, y = log(time))) +
	geom_boxplot() +
	xlab("Function") +
	ylab("Log(time)") +
	theme_bw()

Introduction

I am a Ph.D. candidate in political science at Stanford University. My dissertation focuses on the consequences of foreign direct investment, especially in commercial mining, for conflict and political development in sub-Saharan Africa.

I do my best to keep this site updated with information about my current and past research projects. Please don’t hesitate to email me with any questions (darinc[at]stanford.edu).