What I'm learning from "Using Big Data to Solve Social and Economic Problems"

Recently I’ve been auditing a Harvard course by Raj Chetty, “Using Big Data to Solve Economic and Social Problems”, freely available here.

It’s given me a lot to think about, both in the content of the lectures as well as how Chetty organizes the course and delivers the material.

Motivation

It’s the kind of economics class I think everyone should start with, and addresses the biggest complaint on the subject I have heard from new students to former econ majors like myself: too much theory and not enough impact on or connection to the real world. Chetty jumps straight into the impact and motivates learning the methods and theory of the course by offering them as tools to solve the biggest problems of our time: economic inequality and mobility, climate change, health, and so on.

Hooking students with the fun part first reminds me of Hadley’s and Garret Grolemund’s approach to teaching data science. They start with data exploration (ggplot2) first, jumping right into visualization to pique students’ interest and build up motivation for the course. David Robinson summarizes the merits of this approach in this great piece.

Big data

When I first started the course, I was unsure about Chetty’s use of the “big data” buzzword. In industry it’s often more of a marketing term than anything substantive. But his use of the term is instructive. He explains how the advent of digitized broad datasets have enabled researchers to ask questions in new ways. In particular, I appreciate the way he .

Chetty focuses on how big data provide two benefits:

  1. high precision due to large sample sizes
  2. ability to zoom into subgroups-specific effects and interaction effects

Given that most of the data in the course is observational, the course also gently introduces causal inference. Rather than provide an introduction to the field as a whole, the course proceeds method by method (e.g., regression discontinuity, instrumental variables), focusing on application, intuition, and results rather than implementation details. The implementation details are absolutely important, but grounding them in makes the process of learning the drier material much more palatable.

On the larger point of correlation vs. causation, Chetty takes the student through the researcher’s thought process in building a body of evidence, of identifying important factors that may contribute to your outcome of interest, and then designing additional experiments or quasi-experiments to identify their effects. This was a refreshing take on the traditional “correlation ≠ causation” that often leaves students confused about how to move forward.

What’s coming

As I’m going through the course, I’ll be creating a gitbook here with my notes from lectures and in exploring the data made available by the Opportunity Insights team. I was inspired by this fantastic gitbook for ISLR with chapter notes and solutions to exercises using the tidyverse.

Brief notes on some key things the course has highlighted (I’ll expand on these more in the gitbook in the next few weeks): - identification - how to tell if the quantities you’re observing in the data actually relate to the phenomena of interest - intuition for how to use methods for (I see promise for using list-cols and purrr for this ) - we need effects to hold on average (i.e., the error term not correlated), not for everyone - how to interpret statistically insignificant results (i.e., we don’t have the precision or amount of data needed to make a statement, not that there is no effect)

More broadly, I’m going to spend some time thinking about how “big data” and standard econometric techniques relate. To start, I’ll be re-reading Hal Varian’s article “Big Data - New Tricks for Econometrics”.


See also

comments powered by Disqus