Spring 2016 - 60560 - PA397C - Advanced Empirical Methods for Policy Analysis

Statistical Analysis and Learning

Large datasets are increasingly becoming available across many sectors such as healthcare, energy, and online markets. This course focuses on methods that allow “mining” such datasets to uncover underlying relationships and patterns in the data. The course starts with a review of basic statistical concepts and linear regression. But the course will focus mostly on classification and clustering based on non-regression techniques such as tree-based approaches, support vector machines, and unsupervised learning. The course will also include a brief introduction to the vast field of Probabilistic Graphical Models (PGM). In the problem sets and tutorials we will examine policy applications in: healthcare; energy; transportation; online markets; and patent systems. This course is intended for first and second year Masters students. Ph.D. students with an interest in non-regression based quantitative methods may also find this course useful. Topics to be covered include: Review of statistical concepts; Linear regression; Classification (Linear Discriminant Analysis and Quadratic Discriminant Analysis); Resampling and validation (cross validation; bootstrap;); Tree-based approaches (decision trees; bagging; random forests); Support vector machines (classification using hyperplanes); Unsupervised learning (principal components analysis; K-means clustering). Textbook: An Introduction to Statistical Learning, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. A pdf of the text and all associated datasets are available at: Additional readings: Relevant material from the textbook for each unit will be augmented with 2-3 additional readings, typically highly accessible and relevant journal articles selected by the instructor.